[HN Gopher] DALL*E: Creating Images from Text
___________________________________________________________________
DALL*E: Creating Images from Text
Author : todsacerdoti
Score : 363 points
Date : 2021-01-05 19:08 UTC (3 hours ago)
(HTM) web link (openai.com)
(TXT) w3m dump (openai.com)
| anthk wrote:
| There was some programming language akin to PovRay (not to
| raytrace) in order to describe the scene with commands like
| "place a solid here" and so on.
|
| I can't remember its name.
|
| EDIT:
|
| https://www.contextfreeart.org/
| scribu wrote:
| There are several projects like this, but they can only
| generate abstract shapes, i.e. they're much lower-level than a
| natural language caption.
| gfody wrote:
| this is incredible but I can't help but feel like we're skipping
| some important steps by working with plain text and bitmaps - "a
| collection of glasses sitting on a table" sometimes eyeglasses
| sometimes drinking glasses sometimes a weird amalgamation. and as
| long as we're OK with ambiguity in every layer are we really ever
| going to be able to meaningfully interrogate the reasons behind a
| particular output?
| ludwigschubert wrote:
| That's a delightful result you all; and beautifully explored,
| too!
| saberience wrote:
| Maybe I'm missing something but does it say what library of
| images was used to train this model? I couldn't quite understand
| the process of building DALL-E. Did they have a large database of
| labeled images and they combined this database with GPT-3?
| jcims wrote:
| They are going to train on YouTube/PornHub before long and it's
| going to get weird.
| irrational wrote:
| Combine this with deep fakes.
|
| Donald Trump is Nancy Pelosi's and AOC's step-brother in a
| three-way in the Lincoln Bedroom.
| ve55 wrote:
| I really do think AI is going to replace millions of workers very
| quickly, but just not in the order that we used to think of. We
| will replace jobs that require creativity and talent before we
| will replace most manual factor workers, as hardware is
| significantly more difficult to scale up and invent than
| software.
|
| At this point I have replaced a significant amount of creative
| workers with AI for personal usage, for example:
|
| - I use desktop backgrounds generated by VAEs (VD-VAE)
|
| - I use avatars generated by GANs (StyleGAN, BigGAN)
|
| - I use and have fun with written content generated by
| transformers (GPT3)
|
| - I listen to and enjoy music and audio generated by autoencoders
| (Jukebox, Magenta project, many others)
|
| - I don't purchase stock images or commission artists for many
| previous things I would have when a GAN exists that already makes
| the class of image I want
|
| All of this has happened in that last year or so for me, and I
| expect that within a few more years this will be the case for
| vastly more people and in a growing number of domains.
| ErikAugust wrote:
| Isn't training data effectively a form of sampling?
|
| Couldn't any creator of images that a model was trained on sue
| for copyright infringement?
|
| Or do great artists really just steal (just at a massive
| scale)?
| ve55 wrote:
| Currently that is not the case:
|
| >Models in general are generally considered "transformative
| works" and the copyright owners of whatever data the model
| was trained on have no copyright on the model. (The fact that
| the datasets or inputs are copyrighted is irrelevant, as
| training on them is universally considered fair use and
| transformative, similar to artists or search engines; see the
| further reading.) The model is copyrighted to whomever
| created it.
|
| Source (scroll up slightly past where it takes you):
| https://www.gwern.net/Faces#copyright
| ErikAugust wrote:
| Thank you, this is the part I find most relevant:
|
| "Models in general are generally considered "transformative
| works" and the copyright owners of whatever data the model
| was trained on have no copyright on the model. (The fact
| that the datasets or inputs are copyrighted is irrelevant,
| as training on them is universally considered fair use and
| transformative, similar to artists or search engines; see
| the further reading.) The model is copyrighted to whomever
| created it. Hence, Nvidia has copyright on the models it
| created but I have copyright under the models I trained
| (which I release under CC-0)."
| visarga wrote:
| I bet they can claim copyright up to the gradients
| generated on their media, but in the end the gradients get
| summed up, so their contribution is lost in the cocktail.
|
| If I write a copyrighted text on a book, then I print a
| million other texts on top of it, in both white an black,
| mixing it all up to be like white noise, would the original
| authors have a claim?
| yowlingcat wrote:
| > We will replace jobs that require creativity
|
| Frankly, I think the "AI will replace jobs that require X"
| angle of automation is borderline apocalyptic conspiracy porn.
| It's always phrased as if the automation simply stops at making
| certain jobs redundant. It's never phrased as if the automation
| lowers the bar to entry from X to Y for /everyone/, which
| floods the market with crap and makes people crave the good
| stuff made by the top 20%. Why isn't it considered as likely
| that this kind of technology will simply make the best 20% of
| creators exponentially more creatively prolific in quantity and
| quality?
| ve55 wrote:
| > Why isn't it considered as likely that this kind of
| technology will simply make the best 20% of creators
| exponentially more creatively prolific in quantity and
| quality?
|
| I think that's well within the space of reasonable
| conclusions. For as much as we are getting good at generating
| content/art, we are also therefore getting good at assisting
| humans at generating it, so it's possible that pathway ends
| up becoming much more common.
| rich_sasha wrote:
| Not to undermine this development, but so far, no surprise, AI
| depends on vast quantities of human-generated data. This leads
| us to a loop: if AI replaces human creativity, who will create
| novel content for new generation of AI? Will AI also learn to
| break through conventions, to shock and rewrite the rules of
| the game?
|
| It's like efficient market hypothesis: markets are efficient
| because arbitrage, which is highly profitable, makes them so.
| But if they are efficient, how can arbitrageurs afford to stay
| in business? In practice, we are stuck in a half-way house,
| where markets are very, but not perfectly, efficient.
|
| I guess in practice, the pie for humans will keep on shrinking,
| but won't disappear too soon. Same as horse maintenance
| industry, farming and manufacturing, domestic work etc. Humans
| are still needed there, just a lot less of them.
| p1esk wrote:
| _if AI replaces human creativity, who will create novel
| content for new generation of AI?_
|
| Vast majority of human generated content is not very novel or
| creative. I'm guessing less than 1% of professional human
| writers or composers create something original. Those people
| are not in any danger to be replaced by AI, and will probably
| be earning more money as a result of more value being placed
| on originality of content. Humans will strive (or be forced)
| to be more creative, because all non-original content
| creation will be automated. It's a win-win situation.
| ve55 wrote:
| > Will AI also learn to break through conventions, to shock
| and rewrite the rules of the game?
|
| I think AlphaGo was a great in-domain example of this. I
| definitely see things I'd refer to colloquially as
| 'creativity' in this DALL-E post, but you can decide for
| yourself, but that still isn't claiming it matches what some
| humans can do.
| rich_sasha wrote:
| True, but AlphaGo exists in a world where everything is
| absolute. There are new ways of playing Go, but the same
| rules.
|
| If I train an AI on classical paintings, can it ever invent
| Impressionism, Cubism, Surrealism? Can it do irony? Can it
| come up with something altogether new? Can it do meta?
| "AlphaPaint, a recursive self-portrait"?
|
| Maybe. I'm just not sure we have seen anything in this
| dimension yet.
| ve55 wrote:
| >If I train an AI on classical paintings, can it ever
| invent Impressionism, Cubism, Surrealism?
|
| I see your point, but it's an unfair comparison: if you
| put a human in a room and never showed them anything
| except classical paintings, it's unlikely they would
| quickly invent cubism either. The humans that invented
| new art styles had seen so many things throughout their
| life that they had a lot of data to go off of.
| Regardless, I think we can do enough neural style
| transfer already to invent new styles of art though.
| ryan93 wrote:
| Those don't seem in any way similar to like writing a tv show
| or animating a Pixar movie.
| ve55 wrote:
| I agree, and due to the amount of compute that is required
| for those types of works I think those are still quite awhile
| away.
|
| But the profession for creative individuals consists of much
| more than highly-paid well-credentialed individuals working
| at well-known US corporations. There are millions of artists
| that just do quick illustrations, logos, sketches, and so on,
| on a variety of services, and they will be replaced far
| before Pixar is.
| [deleted]
| ignoranceprior wrote:
| Do you think investing in MSFT/GOOGL is the best way to profit
| off this revolution?
| ve55 wrote:
| It's too hard to say I think. Big players will definitely
| benefit a lot, so it probably isn't a _bad_ idea, but if you
| could find the right startups or funds, you might be able to
| get significantly more of a return.
| karmasimida wrote:
| I think this is actually not a bad thing.
|
| I won't say many of those things are creativity driven. There
| are more like auto assets generation.
|
| One use case of such model would be in gaming industry, to
| generate large amount of assets quickly. This process along
| takes years, and more and more expensive as gamers are
| demanding higher and higher resolution.
|
| AI can make this process much more tenable, bring down the
| overall cost.
| sushisource wrote:
| > - I use and have fun with written content generated by
| transformers (GPT3)
|
| > - I listen to and enjoy music and audio generated by
| autoencoders (Jukebox, Magenta project, many others)
|
| Really, you've "replaced" normal music and books with these?
| Somehow I doubt that.
| notJim wrote:
| What are you talking about, this is my favorite album:
| https://www.youtube.com/watch?v=K0t6ecmMbjQ
| ve55 wrote:
| Not entirely, no, I don't hope I implied that. I listen to
| human-created music every day. I just mean to say that I've
| also listened to AI-created music that I've enjoyed, so it's
| gone from being 0% of what I listen to to 5%, and presumably
| may increase much more later.
| p1esk wrote:
| You should try Aiva (http://aiva.ai). At some point I was
| mostly listening to compositions I generated through that
| platform. Now I'm back to Spotify, but AI music is
| definitely on my radar.
| ve55 wrote:
| Looks great, thanks for the suggestion
| [deleted]
| Impossible wrote:
| I believe that AI will accelerate creativity. This will have a
| side effect of devaluing some people's work (like you
| mentioned), but it will also increase the value of some types
| of art and, more importantly, make it possible to do things
| that were impossible before, or allow for small teams and
| individuals to produce content that were prohibitively
| expensive.
| [deleted]
| minimaxir wrote:
| There still needs to be some sort of human curation, lest
| bad/rogue output risks sinking the entire AI-generated
| industry. (in the case of DALL-E, OpenAI's new CLIP system is
| intended to mitigate the need for cherry-picking, although from
| the final demo it's still qualitative)
|
| The demo inputs here for DALL-E are curated and utilize a few
| GPT-3 prompt engineering tricks. I suspect that for typical
| unoptimized human requests, DALL-E will go off the rails.
| andybak wrote:
| Personally speaking I don't want curation. What is
| fascinating about generative AI is the failure modes.
|
| I want the stuff that no human being could have made - not
| the things that could pass for genuine works by real people.
| minimaxir wrote:
| Failure modes are fun when they get 80-90% of the way there
| and hit the uncanny valley.
|
| Unfortunately many generations fail to hit that.
| ve55 wrote:
| Yes, but there's no reason we can't partially solve this by
| throwing more data at the models, since we have vast amounts
| of data we can use for that (ratings, reviews, comments,
| etc), and we can always generate more en masse whenever we
| need it.
| minimaxir wrote:
| This isn't a problem that can be solved with more data.
| It's a function of model architecture, and as OpenAI has
| demonstrated, larger models generally perform better even
| if normal people can't run them on consumer hardware.
|
| But there is still a _lot_ of room for more clever
| architectures to get around that limitation. (e.g.
| Shortformer)
| ve55 wrote:
| I think it's both - we have a lot of architectural
| improvements that we can try now and in the future, but I
| don't see why you can't take the output of generative art
| models, have humans rate them, and then use those ratings
| to improve the model such that its future art is likely
| to get a higher rating.
| A4ET8a8uTh0 wrote:
| You are probably right. Still, there is hope that this just a
| prelude to getting closer to a Transmetropolitan box ( assuming
| we can ever figure out how to make AI box that can make
| physical items based purely on information given by the user ).
| captainmuon wrote:
| Maybe I'm cynical, but I'm really skeptical. What if this is just
| some poor image generation code and a few hundred minimum wage
| workers manually assembling the examples? Unless I can feed it
| arbitrary sentences we can never know.
|
| I would be disappointed, but not surprized if OpenAI turns out to
| be the Theranos of AI...
| apatap wrote:
| At least GPT-3 can generate texts much faster than a worker
| would need to create them manually.
| captainmuon wrote:
| Right, but I bet the images shown here were preselected.
| sircastor wrote:
| In various Episodes of Star Trek The Next Geneneration, the crew
| asks the computer to generate some environment or object with
| relatively little description. It's a story telling tool of
| course, but looking at this, I can begin to imagine how we might
| get there from here.
| inferense wrote:
| In spite of the close architectural resemblance with the VQVAE2,
| it definitely pushes the text-to-image synthesis domain further.
| I'd be curious to see how well it can perform on a multi-object
| image setting which currently presents a main challenge in the
| field. Also, I wouldn't be surprised if these results were
| limited to openAI scale of computing resources. All in all, great
| progress in the field. The phase of development here is simply
| staggering. Considering the fact that few years back we could
| hardly generate any image in high fidelity.
| minimaxir wrote:
| The way this model operates is the equivalent of machine learning
| shitposting.
|
| Broke: Use a text encoder to feed text data to an image
| generator, like a GAN.
|
| Woke: Use a text and image encoder _as the same input_ to decode
| text and images _as the same output_
|
| And yet, due to the magic of Transformers, it works.
|
| From the technical description, this seems feasible to clone
| given a sufficiently robust dataset of images, although the scope
| of the demo output implies a much more robust dataset than the
| ones Microsoft has offered publicly.
| thunderbird120 wrote:
| It's not really surprising given what we now know about
| autoregressive modeling with transformers. It's essentially a
| game of predict hidden information given visible information.
| As long as the relationship between the visible and hidden
| information is non-random you can train the model to understand
| an amazing amount about the world by literally just predicting
| the next token in a sequence given all the previous ones.
|
| I'm curious if they do a backward pass here, would probably
| have value. They seem to describe sticking the text tokens
| first meaning that once you start generating image tokens all
| the text tokens are visible. That would have the model learning
| to generate an image with respect to a prompt but you could
| also literally just reverse the order of the sequence to have
| the model also learn to generate prompts with respect to the
| image. It's not clear if this is happening.
| minimaxir wrote:
| That approach wouldn't work out of the box; it sees text for
| the first 256 tokens and images for the following 1024
| tokens, and tries to predict the same. It likely would not
| have much to go on if you gave it the 1024 tokens for the
| image and then 256 for the text later since it doesn't have
| much of a basis.
|
| A network optimizing for both use cases (e.g. the training
| set is half 256 + 1024, half 1024 + 256) would _likely_ be
| worse than a model optimizing for one of the use cases, but
| then again models like T5 argue against it.
| nabla9 wrote:
| Shitposts are more creative. What I would like to see is more
| extrapolation and complex mixing:
|
| "A photo of a iPhone from the stone age."
|
| "Adolf Hitler pissing against the wind and enjoying it."
|
| "Painting: Captain Jean-Luc Picard crossing of the Delaware
| River in a Porsche 911".
| Tycho wrote:
| Recently heard a resident machine learning expert describe GPT-3
| as 'not revolutionary in the slightest' or something like that.
| minimaxir wrote:
| It's not revolutionary, just a typical-but-notable iterative
| step in NLP. Which is fine!
|
| I wrote a blog post on that a few months ago after playing a
| bit with GPT-3, and it holds up.
| https://news.ycombinator.com/item?id=23891226
| jokethrowaway wrote:
| It's not, but it showed that we can get a magnitude better
| results by adding a magnitude more data.
|
| To be honest, it's not where I'd like to see efforts in the
| field go.
|
| Not because I'm afraid of AI taking over, but because I'd
| rather have humans recreate something comparable to a human
| brain (functionality wise).
| visarga wrote:
| Who knows, maybe in a few years you will be amazed at the new
| universal transformer chip that runs on 20W of power and can
| do almost any task. No need for retraining, just speak to it,
| show it what you need. Even prompt engineering has been
| automated (https://arxiv.org/abs/2101.00121) so no more
| mystery. So much for the new hot job of GPT prompt engineer
| that would replace software dev.
| asbund wrote:
| I was skeptical before, but now i open to this idea
| visarga wrote:
| It's revolutionary in costs, and delivers for every dollar
| spent.
| FL33TW00D wrote:
| I think that they're correct saying that GPT-3 isn't
| revolutionary, since it just demonstrates the power of scaling.
| However I would argue that the underlying architecture, the
| Transformer (GP(T)), is/was/will be revolutionary.
| wwarner wrote:
| Similar to Wordseye https://www.wordseye.com/
| thepace wrote:
| Wordeye seems to be about scene generation out of pre-existing
| building blocks where as DALL-E is about creating those
| building blocks themselves.
| visarga wrote:
| I'm wondering why the image comes out non-blocky because
| transformers would take slices of the image as input. They say
| they have about 1024 tokens for the image and that would mean
| 32x32 patches. How is it possible that these patches align along
| the edges so well and not have JPEG like artifacts?
| minimaxir wrote:
| If you read footnote #2, the source images are 256x256 but
| downsampled using a VAE, and presumably upsampled using a VAE
| for publishing (IIRC they are less prone to the infamous GAN
| artifacts).
| hnthrowopen wrote:
| Is there a link to the git repo or is OpenAI not really open?
| jokethrowaway wrote:
| the only open thing is the name
| wccrawford wrote:
| I suspect you meant for Dall-E specifically, but this is their
| repo. Found on their about page.
|
| https://github.com/openai/
| CyberRabbi wrote:
| Seems like we're getting closer to AI driven software
| engineering.
|
| Prompt: a Windows GUI executable that implements a scientific
| calculator.
| hooande wrote:
| What you'll get is the same thing as GPT-3: the equivalent of
| googling the prompt. You can google "implement a scientific
| calculator" and get multiple tutorials right now.
|
| You'll still need humans to make anything novel or interesting,
| and companies will still need to hire engineers to work on
| valuable problems that are unique to their business.
|
| All of these transformers are essentially trained on "what's
| visible to google", which also defines the upper bound of their
| utility
| adamredwoods wrote:
| Possibly, but in software the realm of errors is wider and more
| detrimental. Imagery, the human mind will fill in the gaps and
| allow interpretation. Software, not so much.
| visarga wrote:
| True, but the human mind needs an expensive, singleton body
| in the real world, while a code writing GPT-3 only needs a
| compiler and a CPU to run its creations. Of course they would
| put a million cores to work at 'learning to code' so it would
| go much faster. Compare that with robotics, where it's so
| expensive to run your agents. I think this direction really
| has a shot.
| brian_herman wrote:
| Now we just have to wait for huggingface to create an open source
| implementation. So much for openness I guess if you go on
| Microsoft azure you can use closed ai.
| TravisLS wrote:
| If I put text into this tool and generate an original and unique
| image, who owns that image? If it's OpenAI, do they license it?
| durpkingOP wrote:
| i predict one day you can create animation/videos in the future
| with a variation of this, then you could define
| characters/scripts/etc and then insert a story and it generates a
| movie.
| nl wrote:
| The "collection of glasses sitting on a table" example is
| excellent.
|
| Some pics are of drinking glasses and some are of eye glasses,
| and one has both.
| adamredwoods wrote:
| I also like the telephones from different eras, including the
| future.
| TedDoesntTalk wrote:
| > an illustration of a baby daikon radish in a tutu walking a dog
|
| Wow!
| dinkleberg wrote:
| RIP to all the fiverr artists out there.
|
| This is impressive.
| asbund wrote:
| This is amazing
| ignoranceprior wrote:
| Does this address NLP skeptics' concerns that Transformer models
| don't "understand" language?
|
| If the AI can actually draw an image of a green block on a red
| block, and vice versa, then it clearly understands something
| about the concepts "red", "green", "block", and "on".
| karmasimida wrote:
| I think it is safe to say that learning a joint distribution of
| vision + language, is fully possible at this stage,
| demonstrating by this work.
|
| But 'understanding' itself needs to be further specified, in
| order to be tested even.
|
| What strikes me most is the fidelity of those generated images,
| matching the SOTA from GAN literature with much more variety.
|
| It seems Transformer model might be the best neural construct
| we have right now, to learn any distribution, assuming more
| than enough data.
| TigeriusKirk wrote:
| There are examples on twitter showing it doesn't really
| understand spatial relations very well. Stuff like "red block
| on top of blue block on top of green block" will generate red,
| green, and blue blocks, but not in the desired order.
|
| https://twitter.com/peabody124/status/1346565268538089483
| tralarpa wrote:
| Try a large block on a small block. As the authors also have
| noted in their comments the success rate is nearly zero. One
| may wonder why. Maybe because that's something you see rarely
| in photos? At the end, it doesn't "understand" the meaning of
| the words.
| camdenlock wrote:
| Amazing. Would love to play with this.
|
| Is OpenAI going to offer this as a closed paywalled service? Once
| again wondering how the "open" comes into play.
| [deleted]
| bryanrasmussen wrote:
| I wonder what it makes out of green ideas sleep furiously.
| Marinlemaignan wrote:
| i want to see it go into an infinite loop with an "image
| recognition software" (one where you feed an image and you get a
| written description of if)
| vnjxk wrote:
| I believe it will end up stabilazing on one image or a sequence
| of images whose text return themselves
| ArtWomb wrote:
| "Teapot in the shape of brain coral" yields the opposite. The
| topology is teapot-esque. The texture composed of coral-like
| appendages. Sorry if this is overly semantic, I just happen to be
| in a deep dive in Shape Analysis at the moment ;)
|
| >>> DALL*E appears to relate the shape of a half avocado to the
| back of the chair, and the pit of the avocado to the cushion.
|
| That could be human bias recognizing features the generator
| yields implicitly. Most of the images appear as "masking" or
| "decal" operations. Rather than a full style transfer. In other
| words the expected outcome of "soap dispenser in the shape of
| hibiscus" would resemble a true hybridized design. Like an haute
| couture bottle of eau du toilette made to resemble rose petals.
|
| The name DALL-E is terrific though!
| [deleted]
| dane-pgp wrote:
| > a living room with two white armchairs and a painting of the
| colosseum. the painting is mounted above a modern fireplace.
|
| With the ability to construct complex 3D scenes, surely the next
| step would be for it to ingest YouTube videos or TV/movies and be
| able to render entire scenes based on a written narration and
| dialogue.
|
| The results would likely be uncanny or absurd without careful
| human editorial control, but it could lead to some interesting
| short films, or fan-recreations of existing films.
| alpaca128 wrote:
| I'd love to see what this does with item/person/artwork/monster
| descriptions from Dwarf Fortress. Considering the game has
| creatures like were-zebras, beasts in random shapes and
| materials, undead hair, procedurally generated instruments and
| all kinds of artefacts menacing with spikes I imagine it could
| make the whole thing even more entertaining.
| kome wrote:
| black magic
| chishaku wrote:
| ok how can we use this?
___________________________________________________________________
(page generated 2021-01-05 23:00 UTC)