[HN Gopher] How Imagen Works
       ___________________________________________________________________
        
       How Imagen Works
        
       Author : SleekEagle
       Score  : 92 points
       Date   : 2022-06-23 14:45 UTC (8 hours ago)
        
 (HTM) web link (www.assemblyai.com)
 (TXT) w3m dump (www.assemblyai.com)
        
       | [deleted]
        
       | varispeed wrote:
       | > is trained on hundreds of millions of images and their
       | associated captions
       | 
       | So how do you get access to hundreds of millions of images and
       | use them to create derivative works? Did they get consent from
       | millions of authors?
       | 
       | Or is something like that only available to the rich with access
       | to lawyers on tap?
       | 
       | I mean I can imagine if a nobody wanted to do something like
       | this, they'd get bankrupted by having to deal with all the
       | photographers / artists spotting a tiny sliver of their art in
       | the image produced by the model.
       | 
       | Furthermore, would something like this work with music? For
       | instance, train the model on all Spotify songs and then generate
       | songs based on "Get me a Bach symphony played on sticks with
       | someone rapping like Dr Dre with lisp." Or do music industry have
       | enough money to bully anyone into not doing that?
        
         | SleekEagle wrote:
         | Presumably Google's terms of service or fair use laws. The real
         | restriction is that, even if you had the dataset, training
         | costs tens of thousands of dollars. Only corporations can
         | really afford to train these things.
         | 
         | Regarding music - audio generation with Diffusion Models (the
         | main component of Imagen and DALL-E 2) has been done, but not
         | sure about music specifically. We will definitely reach the
         | point where most e.g. pop beats will be able to be made by AI
         | relatively soon.
         | 
         | All a producer has to do is generate 100 beats and select the
         | one s/he likes, potentially interpolate between 2 or finetune
         | it.
        
         | davikr wrote:
         | I've seen an image generated by AI contain an "Alamy" watermark
         | before.
        
       | dubswithus wrote:
       | If Google has something similar or better it definitely makes it
       | look like OpenAI is wasting its time. None of this relates to
       | AGI.
        
         | SleekEagle wrote:
         | I don't think anyone is saying that humanity is close to AGI,
         | but check out DeepMind's Gato work for a more well-rounded
         | agent:
         | 
         | https://www.deepmind.com/publications/a-generalist-agent
        
           | visarga wrote:
           | I think we're past a certain threshold, maybe not AGI but
           | some definite qualitative change is happening.
        
             | SleekEagle wrote:
             | I mean DALL-E 2 was the first time my jaw really hit the
             | floor, although in fairness GPT-3 probably should've done
             | that, but it's easier to do with images.
             | 
             | And then for this to drop just a month later? Insane. It
             | makes you wonder if they're actually releasing cutting
             | edge, or Google decided to write this paper just because of
             | the publication of DALL-E 2. Maybe they've had this model
             | in the bag for a year.
        
               | alphabetting wrote:
               | Google also released this different text to image model
               | yesterday
               | 
               | https://parti.research.google/
               | 
               | I think they've just got a lot of projects going on under
               | the hood and timing was coincidence.
        
               | SleekEagle wrote:
               | Looks cool although not as good as Imagen. Autoregressive
               | vs Diffusion i guess
        
       | Workaccount2 wrote:
       | I have shown imagen (and dalle2) to a number of people now (non-
       | tech, just everyday friends, family, co-workers) and I have been
       | pretty stunned by the response I get from most people:
       | 
       | "Meh, that's kinda cool? I guess?" or "What am I looking
       | at?"..."Ok? So a computer made it? That seems neat"
       | 
       | To me I am still trying to get my jaw off the floor from 2 months
       | ago. But the responses have been so muted and shoulder shrugging
       | that I think either I am missing something or they are missing
       | something. Even really drilling in, practically shaking them "DO
       | YOU NOT UNDERSTAND THAT THIS IS A ORIGINAL IMAGE CONSTRUCTED
       | ENTIRELY BY AN AI?!?!" and people just seem to see it as a party
       | trick at best.
        
         | Genbox wrote:
         | I find that most people are primarily driven by a need. You
         | need food? Pick some berries. You need warmth? Start a fire.
         | 
         | When it comes to technology - especially advanced technology
         | like Imagen - people don't see the value because they don't
         | have a need associated with it.
        
         | jazzyjackson wrote:
         | I think if you've been paying attentiont to the space, this
         | generation of image diffusion is shocking in how quickly it has
         | improved on what we had a year ago.
         | 
         | But if you've never considered that a computer can produce an
         | original image, this is just a new thing computers can do. OTOH
         | I think it's also a lack of imagination in how useful this is,
         | so far the output has been kind of random, so it seems a little
         | gimmicky. Already "Parti" has gotten much closer to allowing a
         | user to describe exactly what they want in the image, and as
         | people start to see the use cases for them personally, it will
         | hit them that they no longer have to hire someone, they can
         | just type a request into a box.
        
           | SleekEagle wrote:
           | I'm not sure there has been a period of more rapid
           | development in DL than Diffusion Models (maybe
           | transformers?). The next few years will be really
           | interesting.
        
         | joshcryer wrote:
         | I've made perhaps overly absolutist statements like "don't you
         | see! this kills artists jobs!" and it was shrugged off as if I
         | was insane. I probably could've phrased it differently, but to
         | me this is game changing in several fields. Granted, it will
         | open up a new field of "generative artists" but, having played
         | with these things, this is a pretty trivial job, and their
         | training nets are only going to get _better_.
        
           | danielvaughn wrote:
           | To me, it paves the way for creative prototyping. I don't see
           | this as a zero-sum game between artists and AI. Instead, I
           | could see artists using this for some serious time saving,
           | and leveraging that extra time and energy for creating better
           | results.
        
           | Uehreka wrote:
           | I've had a lot of fun playing with Disco Diffusion prompts,
           | but I agree that the people excited about "a generation of
           | prompt artists" are a bit misguided. Soon an AI will emerge
           | that can come up with "better" prompts than you, and the
           | "art" of creating prompts will have a lower skill ceiling.
        
             | [deleted]
        
             | russdill wrote:
             | The GPT algorithms are actually pretty good at making
             | detailed image generation prompts if you ask it to describe
             | in detail the general idea you want.
        
               | SleekEagle wrote:
               | Do you have a link to any papers about this? Would love
               | to check them out
        
             | tsol wrote:
             | Like a neutral network just for making prompts that result
             | in aesthetically pleasing Imagen images? And then maybe we
             | can come up with a neutral net that can decide which
             | pictures are good and which aren't. Then we can just have
             | robots making art for the sake of consumption solely by
             | robots.
        
           | SleekEagle wrote:
           | It could also be used for more nefarious reasons like
           | disinformation campaigns though... it will be interesting to
           | see what the next few years have in store
        
         | dougmwne wrote:
         | I think I can explain this that for most people the whole world
         | is basically magic anyway. They don't understand any of the
         | details about how any digital tech works so to them they have
         | no framework for which things are impressive and which things
         | are not. The just know that computers can do a great many
         | things that they know nothing about. "Oh I can bank online?
         | Ok." "Oh, I can have the computer write my book report for me?
         | Ok." "Oh, this McDonalds is fully staffed by sentiment robots?
         | Ok."
        
           | endymi0n wrote:
           | I think that hits home.
           | 
           | A lot of people would just answer something to the likes of
           | "Well, they made The Matrix with a computer 20 years ago",
           | and technically that's just as true.
           | 
           | From their remote viewpoint on what's happening in IT, the
           | rest is an implementation detail to them.
        
           | GrabbinD33ze69 wrote:
           | A pretty common generalization I've witnessed is many non
           | technical people (even people who are tech savvy but have no
           | CS background do this) is people assuming the feature that is
           | in reality quite difficult to implement won't take much
           | effort, and vice versa.
        
             | [deleted]
        
             | [deleted]
        
           | mortenjorck wrote:
           | This is the other side of the classic XKCD "Tasks"
           | (https://xkcd.com/1425/).
           | 
           | A non-technical person in 2014 (when the above was originally
           | published) would likely have the same conception of the
           | difficulty of recognizing a bird from an image as they would
           | in 2022, even though the task itself has gone from near-
           | insurmountable to off-the-shelf-library in eight years.
           | 
           | Even as Imagen and Dall-E 2 amaze us today, these feats will
           | likely be commonplace in a few years. The non-technical may
           | have only a vague sense that their new TikTok filter is doing
           | something that was impossible only a few years prior.
        
             | dougmwne wrote:
             | Exactly and I was thinking of that XKCD. Very much case in
             | point, I have the Merlin Bird ID app which can determine
             | species from ridiculously blurry photos and can also
             | identify hundreds of birds from their calls alone in noisy
             | environments. In 2014 I would have sworn this would be
             | impossible.
        
         | SleekEagle wrote:
         | I've gotten a lot of "wow, that's cool!"s, which is a pretty
         | fair response for a non-technical person if you ask me!
        
         | thruuavay wrote:
         | Well, I'm still in awe that I have a bunch of walls around me
         | and can cover my body with clothes, or that I'm still alive
         | after all this time, and that I can even rest most of the day
         | and not spend body energy running after or from animals.
         | Amazing stuff.
         | 
         | A program that transforms text to an image? Huh.
        
         | Wistar wrote:
         | I haven't gotten such dismissive responses, but probably only
         | because those I'm inclined to share such things with are the
         | exact kinds of people who'd be blown away by them, and
         | immediately grasp the significance.
        
         | ja3k wrote:
         | I couldn't convince my mother in law it was more impressive
         | than photoshop.
        
         | trention wrote:
         | It's just an illustration of the fact that the average person
         | doesn't give a sh*t about AI "art" and that it will have ~zero
         | cost and ~zero value.
        
           | bergenty wrote:
           | With the amount of context awareness this AI has, there's
           | nothing all that special about human "art" to be honest.
        
             | trention wrote:
             | I am willing to bet that the revenue from AI-generated
             | "art" will be smaller than the revenue from human-generated
             | art in 5 years (or even 10 years) despite the former
             | probably being at least 2 orders of magnitude higher in
             | volume. This is basic supply and demand + acknowledging the
             | fact that humans don't care about AI "achievements".
        
               | bergenty wrote:
               | AI achievements will be indistinguishable from human
               | achievements. Humans will try to pass off AI achievements
               | as their own. The line will become so blurred that it
               | will be impossible to tell the difference.
        
               | trention wrote:
               | If that happens, all art will simply have no value and
               | art as % of GDP will plummet.
               | 
               | Incidentally, this hasn't happened in areas where AI
               | already dominates like chess and go. Magnus Carlsen alone
               | probably generates more "revenue" than all chess AIs
               | combined.
        
           | phailhaus wrote:
           | Treating Imagen as just an "AI art generator" is extremely
           | short sighted. Sure, you could just try to sell the outputs
           | directly. But the real value is using it to supplement larger
           | works. No need for a stock photo subscription service if you
           | can just generate them automatically. Don't need artists to
           | create textures for your simple games. I can spin up a merch
           | shop powered entirely by AI art and nobody would know. The
           | marginal cost of creation is approaching zero.
        
             | SleekEagle wrote:
             | And perhaps even more interestingly these things not only
             | exist but there is competition in this space! Essentially
             | unregulated competition as well (and likely for the next 10
             | years). The cost will be driven into the ground.
        
           | Miraste wrote:
           | The apocryphal Henry Ford quote about the average person
           | wanting better horses comes to mind. People off the street
           | have no concept of the impact this tech and the methods
           | behind it will have. Sure, no one is going to be printing
           | these and hanging them in museums. Very few artists support
           | themselves that way, though. The people diffusion models are
           | coming for are the graphic designers, the concept artists,
           | the marketers, and everyone else with a copy of Photoshop and
           | a Getty subscription. GPT-3 is amazing, but it's also not
           | good enough to be useful. Imagen is industry-destroying.
        
             | trention wrote:
             | Although I agree that a somehow less extreme version of
             | that will happen in the course of this decade bar a legal
             | decision to prohibit using those models, that won't
             | translate to comparable revenues. The companies providing
             | those services will struggle to make even 10% of the
             | salaries of the displaced workers in revenue. In fact, this
             | will probably be a GDP-destroying (though not value-
             | destroying) application of technology.
        
               | SleekEagle wrote:
               | It's not about generating more revenue, it's about
               | cutting costs. Any company that employs graphic designers
               | etc. will be able to cut 90% of the staff.
               | 
               | Video game companies that need concept art? How about 1
               | guy/gal with Imagen to generate baselines and then
               | curating/tailoring as necessary instead of a team of 5
        
               | trention wrote:
               | That has nothing to do with anything I wrote. And doesn't
               | contradict it actually.
               | 
               | Saved costs will not translate to higher margins for
               | those that cut them because all competitors will be able
               | to slash them as well, resulting in lower prices across
               | the board.
        
         | monkeybutton wrote:
         | Perhaps it's the combination of AI being so overhyped in the
         | general public plus media that's already inundated with CGI,
         | that it just doesn't blow them away?
        
         | clircle wrote:
         | People dont care because all their text to image needs are well
         | covered by Google Images.
        
       | skinner_ wrote:
       | > The central intuition in using T5 is that extremely large
       | language models, by virtue of their sheer size alone, may still
       | learn useful representations despite the fact that they are not
       | explicitly trained with any text/image task in mind. [...]
       | Therefore, the central question being addressed by this choice is
       | whether or not a massive language model trained on a massive
       | dataset independent of the task of image generation is a
       | worthwhile trade-off for a non-specialized text encoder. The
       | Imagen authors bet on the side of the large language model, and
       | it is a bet that seems to pay off well.
       | 
       | The way out of this dilemma is to fine-tune T5 on the caption
       | dataset instead of keeping it frozen. The paper notes that they
       | don't do fine-tuning, but does not provide any ablation or other
       | justification. I wonder if it would help or not.
        
       | DonHopkins wrote:
       | Wait, this isn't about the line of intelligent xeroxographic
       | laser printers developed by Imagen Corporation in 1981,
       | supporting the Impress printer language?
       | 
       | https://tug.org/TUGboat/tb02-2/tb03imagen.pdf
       | 
       | https://www.openprinting.org/driver/imagen
        
         | SleekEagle wrote:
         | How do you think it prints the images!
        
       | coding123 wrote:
       | Is this by a person that knows or is guessing?
        
         | watmough wrote:
         | The important part seems to be the diffusion model.
         | 
         | Explanation linked from same page:
         | https://www.assemblyai.com/blog/diffusion-models-for-machine...
        
         | nestorD wrote:
         | The paper is very well explained and, reading this post, they
         | seems to mostly make its content accessible to non domain
         | expert.
        
         | tiborsaas wrote:
         | I guess he read the research paper.
        
         | thunderbird120 wrote:
         | Google published these implementation details
        
       | natch wrote:
       | > Imagen, released just last month, can generate high-quality,
       | high-resolution images given only a description of a scene
       | 
       | "Released"? What? Papers are published. Websites are published.
       | Tools are "released."
       | 
       | Where has Imagen been released?
        
         | bpiche wrote:
         | This implementation popped up on hacker news not too long ago.
         | I got it working on Colab first, and then my own GPU at home.
         | But just barely. Need more memory :)
         | 
         | https://github.com/lucidrains/imagen-pytorch
        
           | Voloskaya wrote:
           | The value is in the data and the trained weights, the
           | implementation is not where the bottleneck is in term of
           | reproducing those models.
           | 
           | Still great work from the author though, but we most
           | definitely cannot say that imagen is released.
        
           | stavros wrote:
           | Wait, so I can try this on Colab right now?
        
             | refulgentis wrote:
             | No, something that's been causing a lotta confusion in AI
             | art is people stand up quick implementations generally
             | matching the general description in the paper, but, they're
             | not really investing in training them. Then people see
             | "imagen-pytorch" on GitHub and get confused, either think
             | it's Imagen itself or a suitable replica of it.
             | 
             | There's like 3 projects named DallE, and then the 2 real
             | DallEs...frustrating.
        
               | joshcryer wrote:
               | People are really thirsty to play with this tech, you
               | can't blame them. Just search for dataset creators on
               | Hugging Face. I'd link directly to several of them
               | running but it would just overwhelm the creators. If you
               | want to be in early you'll find them. The beautiful thing
               | is open source is going to make this stuff available for
               | _everyone_ and in very short timeframe. It 's crazy how
               | fast it moves.
        
               | spullara wrote:
               | It is a suitable replica of it. Just isn't trained.
        
               | natch wrote:
               | But the training is the thing that would make it
               | suitable.
        
               | bpiche wrote:
               | I mean, you try training this thing without a warehouse
               | full of GPUs... to me, the algorithm is just as
               | interesting as the model. Perhaps more so.
        
           | echelon wrote:
           | Are there any large publicly available models, ready to fine
           | tune and deploy, that were trained on massive data sets?
           | 
           | I really want to build services with these.
        
       | alexccccc wrote:
       | Super interesting
        
       ___________________________________________________________________
       (page generated 2022-06-23 23:01 UTC)