[HN Gopher] No, DALL-E doesn't have a secret language
___________________________________________________________________
No, DALL-E doesn't have a secret language
Author : doener
Score : 111 points
Date : 2022-06-01 20:08 UTC (2 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| jawarner wrote:
| The tweet is in response to a preliminary paper [1] [2] studying
| text found in images generated by, e.g., "Two whales talking
| about food, with subtitles." DALL-E doesn't generate meaningful
| text strings in the images, but if you feed the gibberish text it
| produces -- "Wa ch zod ahaakes rea." back into the system as a
| prompt, you would get semantically meaningful images, e.g.,
| pictures of fish and shrimp.
|
| [1]
| https://giannisdaras.github.io/publications/Discovering_the_...
|
| [2] https://twitter.com/giannis_daras/status/1531693093040230402
| dekhn wrote:
| I think the tweeter is being a bit too pedantic. Personally I
| spent some time thinking about embeddings, manifolds, the
| structure of language, scientific naming, and what the decoding
| of the points near the center of clusters in embedding spaces
| look like (archetypes), after seeing this paper. I think making
| networks and asking them to explain themselves using their own
| capabilities is a wonderful idea that will turn out to be a
| fruitful area of research in its own right.
| austinjp wrote:
| > asking [neural networks] to explain themselves using their
| own capabilities
|
| Exactly. This could be profound. I'm looking forward to
| further work here. Sure, the examples here are daft, but
| developing this approach could be like understanding a
| talking lion [0] only this time it's a lion of our making.
|
| [0] https://tzal.org/understanding-the-lion-the-in-joke-of-
| psych...
| lotaezenwa wrote:
| I concur that the tweeter is being pedantic.
|
| This is largely some embedding of semantics that we currently
| do not fully have a mapping for, precisely because it was
| generated stochastically.
|
| Saying it was "not true" seems like clickbait.
| koboll wrote:
| Especially since his results confirm _most_ of what the
| original thread claimed. A couple of the inputs did not
| reliably replicate, but "for the most part, they're not
| true" seems straightforwardly false. He even seems to
| deliberately ignore this sometimes, such as when he says "I
| don't see any bugs" when there is very obviously a bug in
| the beak of all but two or three of the birds.
| mannykannot wrote:
| When I zoomed in, I felt only four in ten birds clearly
| had anything in their beaks, and in each case it looked
| like vegetable matter. In the original set, only one
| clearly has an insect in its beak.
|
| Are there higher-resolution images to be had?
| ASalazarMX wrote:
| Lower in the same thread he accepts that his main tweet
| was clickbaity, and that actually there's consistency in
| some of the results.
| Jweb_Guru wrote:
| Not really, he afterwards says that he was more trying to
| inject some humility. He really doesn't think this is
| measuring anything of interest. For the birds result in
| particular, see https://twitter.com/BarneyFlames/status/1
| 531736708903051265.
| ASalazarMX wrote:
| If DALL-E had a choice to output "Command not understood",
| maybe we wouldn't be discussing this.
|
| Like those AIs that guess what you draw, and recognize
| random doodling as "clouds", DALL-E is probably using the
| least unlikely route. That a gibberish word is drawn as a
| bird is maybe because it was "bird (2%), goat (1%), radish
| (1%)".
|
| 1. https://quickdraw.withgoogle.com
| wruza wrote:
| Is this and a previous tweet a ML-guys discussion? My layman
| understanding of neural networks is that the core operation is
| you basically kick a figure down the hill and see where it ends
| up, but both the figure and the hill are N-dimensional objects,
| where N is too huge to comprehend. Of course some nonsensical
| figures end up at valid locations, but can you really expect some
| stable inner structure of the hill-figure interaction? I think
| it's unlikely that there is a place in a learning method to
| produce one. NNs can give interesting results, but they don't
| magically rewrite their own design yet.
|
| Would still be interesting to see how the output changes with
| little changes to these inputs. If my vague understanding is at
| all close, this will reveal the "faces" that are more "noisy"
| than the others. Not sure what that gives though.
| quink wrote:
| If it smells like overfitting, it probably is overfitting.
| danamit wrote:
| I do not believe AI claims whenever I read them, and this is
| happening at the risk of me being a cynic and a disbeliever in
| the field. And I am not sure if that's bad for society or bad
| just for me.
|
| I am more likely to believe celebrity gossip than AI news
| articles.
| aaron695 wrote:
| peanut_worm wrote:
| I think this guy is being a bit pedantic. It is returning semi-
| consistent results for gibberish, which is interesting. Thats all
| the original poster meant.
| KaoruAoiShiho wrote:
| Am I dumb but does that thread prove that DALL-E does in fact
| have a secret language, it's just not exactly the meaning
| described in the paper?
| deltaonefour wrote:
| There's some form of language here... the correlations are
| evidence enough. The Grammar I believe is complex and likely
| not human grammar thus certain words when paired with other
| words can negate the meaning of the word all together or even
| completely change it.
|
| For example "hedge" combined with "hog" is neither a "hedge"
| nor is it a "hog" nor is it some sort of horrific hybrid
| mixture of hedges and hogs. A hedgehog is tiny rodent. Most
| likely this is what's going on here.
|
| The domain is almost infinite. And the range is even greater.
| Thus it's actually realistic to say that there are must be
| hundreds of input and output sets that form alternative
| languages.
| Jweb_Guru wrote:
| I don't think there is any evidence of a language here unless
| you stress the definition to the point of absurdity. It will
| not even reliably produce the same kinds of images that had
| the text that it output, which was the original premise of
| the claim. Obviously, probing some overconstrained high
| dimensional space where it's never rewarded for uncertainty
| has to produce _something_ ; that doesn't mean that something
| is a language.
| [deleted]
| redredrobot wrote:
| So his argument is that the text clearly maps to concepts in the
| latent space, but when composing them the results are unexpected,
| so it isn't language? Why isn't this better described as 'the
| rules of composition are unknown'?
| rcoveson wrote:
| That framing is worse because it hides an assumed conclusion,
| i.e. that there are rules of composition.
| redredrobot wrote:
| But don't we already know that composition exists in DALL-E?
| Don't the points shown in the tweet indicate that some form
| of composition exists? The 3D renders are clearly render-
| like, the painting and cartoons are clearly in the
| appropriate style.
| rcoveson wrote:
| "That there exist rules of composition of the hypothesized
| secret DALL-E language" is a much stronger claim than that
| it "understands" composition of text in the real languages
| it was trained on.
|
| Though I'll also point out that even evidence for that
| weaker claim is tenuous. It definitely knows how to move an
| image closer to "3D render" in concept-space, but it
| doesn't seem to understand the linguistic composition of
| your request. For example, you'd have an extremely hard
| time getting it to generate an image of a person using 3D
| rendering software, or a "person in any style that isn't 3D
| render"; it would probably just make 3D renders of persons.
|
| I haven't played around with it myself, I'm going off the
| experiences of others. For example:
|
| https://astralcodexten.substack.com/p/a-guide-to-asking-
| robo...
| joshmarlow wrote:
| I found this analysis interesting
| https://twitter.com/Plinz/status/1531711345585860609?t=Yinol...
| SilverBirch wrote:
| This just feels like one of these topics where you'd really want
| a liguist. Someone who really understands the construction and
| evolution of langauge to observe some of the underlying _reasons_
| for why language is constructed the way it is. Because I guess
| that 's what DALL-E partly is, it's trying to approximate that,
| and the interesting thing would be where it differs from real
| language, rather than matches it. If I give it a made up word
| that looks like the latin phrase that looks like a species of
| bird, then it working like I've given it a latin phrase that is a
| species of bird is pretty reasonable. If you said to me "Homo
| heidelbergensis" I wouldn't _know_ that was a species of pre-
| historic human, but I would feel pretty comfortable making that
| kind of leap.
|
| I also think you could probably hire a team of linguists pretty
| cheap compared to a team of AI engineers.
| masswerk wrote:
| I don't think that this related to language, at all. First,
| let's ask, is there a way for DALL-E to refuse an output (as
| in, this makes no sense). Then, what would we expect the output
| for gibberish to be like? Isn't this still subject to filtering
| for best "clarity" and best signals? While I don't think that
| these are collisions in the traditional sense of a hash
| collision, any input must produce a signal, as there is no null
| path, and what we see is sort of a result of "collisions" with
| "legitimate" paths. Still, this may tell us some about the
| inner structure.
|
| Also, there is no way for vocabulary to exist on its own
| without grammar, as these are two sides of the phenomenon, we
| call language. Some signs of grammar had to emerge together
| with this, at once. However...
|
| ----
|
| Edit: Let's imagine a typical movie scene. Our nondescript
| individual points at himself and utters "Atuk" (yes, Ringo
| Starr!) and then points at his counterpart in this
| conversation, who utters "Carl Benjamin von Richterslohe". This
| involves quite an elaborate system of grammar, where we already
| know that we're asking for a designator, that this is not the
| designator for the act of pointing, and that by decidedly
| pointing at a specific object, we'd ask for a specific
| designator not a general one. Them C.B. von Richterslohe, our
| fearless explorer, waves his hand over the backdrop of the
| jungle, asking for "blittiri" in an attempt to verify that this
| means "bird", for which Atuk readily points out a monkey. -
| While only nouns have been exchanged, there's a ton of grammar
| in this.
|
| And we haven't even arrived at thinks like, "a monkey sitting
| at the foot of a tree". Which is mostly about the horizontal
| and vertical axes of grammar, along which we align things and
| where we can substitute one thing for another in a specific
| position, which ultimately provides them with meaning (by what
| combinations and substitutions are legitimate ones and which
| are not).
|
| Now, in light of this, that specific compounds are changing
| their alleged "meaning" radically, when aligned, doesn't allow
| for high hopes for this to be language.
| runj__ wrote:
| I was thinking about a system for pulling data from verbal
| nonsense the other day, speaking in tongues or something
| similar. I can create a bunch of noises that lack obvious
| meaning for me, but obviously they have some meaning that can
| be learned since humans are terrible at being truly random
| (lol XD).
|
| I wonder what level I would be able to share ideas I lack the
| words for, my perceived bitrate at creating "random" noise is
| certainly higher than when verbally communicating an idea to
| another human. Will we even share a common language in the
| future? Or will we have our own language that is translated
| to other people?
| masswerk wrote:
| Well, I can only answer with kind of a pun. With
| Wittgenstein, language is a constant conversation about the
| extent of the world, about what is and what is not. As
| such, it is necessarily shared. In the _tractatus_ we find,
|
| > 5.62 (...) For what the solipsist means is quite correct;
| only it cannot be _said,_ but makes itself manifest. The
| world is my world: this is manifest in the fact that the
| limits of _language_ (of that language which alone I
| understand) mean the limits of my world. [1]
|
| So, something could become _apparent,_ but you would still
| haven 't _said_ anything (as it 's not part of that
| conversation). ;-)
|
| [1] https://www.masswerk.at/digital-
| library/catalog/wittgenstein...
|
| (I deem this edition to be somewhat appropriate in
| context.)
| belugacat wrote:
| Given that DALL-E is a giant matrix multiplication that
| correlates fuzzy concepts in text to fuzzy concepts in images,
| wouldn't one expect that there will be hotspots of nonsensical
| (to us) correlations, eg between "apoploe vesrreaitais" and
| "bird"? Intuitively feels like an aspect of the no free lunch
| theorem.
| axg11 wrote:
| Exactly this. At a high level, DALL-E is mapping text to a
| (continuous) matrix and then mapping that matrix to an image
| (another a matrix). All text inputs will map to _something_.
| DALL-E doesn't care if that mapping makes sense, it has been
| trained to produce high-quality outputs, not to ensure the
| validity of mappings.
|
| None of this makes DALL-E any less impressive to me. High
| quality image generation is a truly amazing result. Results
| from foundational models (GPT-3, PaLM, DALL-E, etc) are so
| impressive that they're forcing us to reconsider the nature of
| intelligence and raise the bar. That's a sign of a job well
| done to me.
| LoveMortuus wrote:
| But if it's just mapping text to image then it would be fair
| to assume that using the same text would result in the same
| image. But does that actually happen?
| Jweb_Guru wrote:
| No, it does not. It also doesn't always generate the same
| category of image. See https://twitter.com/realmeatyhuman/s
| tatus/153173861680386457....
|
| As much as people would like there to be, there really does
| not seem to be anything here. The original author doesn't
| think so, either (would need to refind the tweet).
| tbalsam wrote:
| I have seen many bad abuses of the NFL theorem's name.
|
| This is by far the worst.
| smeagull wrote:
| Yeah. The problem here is that the network only has room for
| concepts, and hasn't been trained to see meaningless crap. Nor
| does it really have any way to respond with "This isn't a
| sentence I know", it just has to come up with an image that
| best matches whatever prompt it has been fed.
| skybrian wrote:
| "Secret language" is clickbait, but it seems like systematically
| exploring how it responds to gibberish might find something
| interesting?
|
| Also, I'm wondering if there is some way that these models could
| have a decent error response rather than responding to every
| input?
| dang wrote:
| Recent and related:
|
| _DALL-E 2 has a secret language_ -
| https://news.ycombinator.com/item?id=31573282 - May 2022 (109
| comments)
| deltaonefour wrote:
| It's OBVIOUS what's going on here. When you combine TWO different
| languages you get stuff that appears as NONSENSE. You have to
| stay in the same language!
|
| There is for sure a set of consistent words that produce output
| that makes sense to us. He just picked the wrong set!
| sydthrowaway wrote:
| Dumb question, but how are DALL-E's (and any other AI generative
| algorithm) result's so.. smooth?
|
| For example, I could write a heuristic algorithm to product the
| same thing using a Google image search, but it would look like MS
| word clip art.
| Enginerrrd wrote:
| Lots of denoising steps after the initial attempt at forming a
| connection to the prompt is made?
| sillysaurusx wrote:
| This is one of my favorite topics in all of AI. It was the most
| surprising and mysterious discovery for me.
|
| The answer is that the training process literally has to make
| the results smooth. That's how training works.
|
| Imagine you have 100 photos. Your job is to classify them by
| color. You can place them however you want, but similar colors
| should be physically closer together.
|
| You can imagine the result would look a lot like a photoshop
| RGB picker, which is smooth.
|
| The surprise is, this works for any kind of input. Even text
| paired with images.
|
| The key is the loss function (a horrible name). In the color
| picker example, the loss function would be how similar two
| colors are. In the text to image example, it's how _dissimilar_
| the input examples are from each other (Contrastive Loss). The
| brilliance of that is, pushing dissimilar pairs apart is the
| same thing as pulling similar pairs together, when you train
| for a long time on millions of examples. Electrons are all
| trying to push each other apart, but your body is still smooth.
|
| The reason it's brilliant is because it's far easier to measure
| dissimilar pairs than to come up with a good way of judging
| "does this text describe this image?" -- you definitely know
| that it isn't a bicycle, but you might not know whether a car
| is a corvette or a Tesla. But both the corvette and the Tesla
| will be pushed away from text that says it's a bicycle, and
| toward text that says it's a car.
|
| That means for a well-trained model, the input _by definition_
| is smooth with respect to the output, the same way that a small
| change in {latitude,longitude} in real life has a small change
| in the cultural difference of a given region of the world.
| Michelangelo11 wrote:
| Do you by any chance have a link to a paper or article that
| explains this in detail? I'd love to understand it better.
| jeabays wrote:
| sillysaurusx wrote:
| It doesn't exist. The above explanation is the result of me
| spending almost all of my time immersing myself in ML for
| the last three years.
|
| gwern helped too. He has an intuition for ML that I'm still
| jealous of.
|
| Your best bet is to just start building things and worry
| about explanations later. It's not far from the truth to
| say that even the most detailed explanation is still a
| longform way of saying "we don't really know." Some people
| get upset and refuse to believe that fundamental truth, but
| I've always been along for the ride more than the
| destination.
|
| It's never been easier to dive in. I've always wanted to
| write detailed guides on how to start, and how to navigate
| the AI space, but somehow I wound up writing an ML fanfic
| instead: https://blog.gpt4.org/jaxtpu
|
| (Fun fact: my blog runs on a TPU.)
|
| I'm increasingly of the belief that all you need is a
| strong desire to create things, and some resources to play
| with. If you have both of those, it's just a matter of time
| -- especially putting in the time.
|
| That link explains how to get the resources. But I can't
| help with how to get a desire to create things with ML.
| Mine was just a fascination with how strange computers can
| be when you wire them up with a small dose of calculus that
| I didn't bother trying to understand until two years after
| I started.
|
| (If you mean contrastive loss specifically,
| https://openai.com/blog/clip/ is decent. But it's just a
| droplet in the pond of all the wonderful things there are
| to learn about ML.)
| Michelangelo11 wrote:
| Thanks! Really appreciate the response.
| snovv_crash wrote:
| IMO the term "cost function" is much more intuitive than
| "loss function" - it tells you the cost, which it attempts to
| minimize by some iterative process (in this case training)
| hooande wrote:
| this is a very intuitive analysis. well done and thanks
| sizzle wrote:
| thanks for sharing your hard fought knowledge to us curious
| bystanders
| NickNaraghi wrote:
| Fantastic, this helped me a lot! Thanks for taking the time
| to write this out.
| deltaonefour wrote:
| I actually completely lost interest once I found this out.
| Simply taking some ML course like the old Andrew Ng courses
| online are enough for you to get the general idea.
|
| ML is simply curve fitting. It's a applied math problem
| that's quite common. In fact I lost a lot of interest in
| intelligence in general once I realized this was all that was
| going on. The implications really say that all of
| intelligence is really some form of curve fitting.
|
| The simplest form of this is linear regression which is used
| to derive an equation for a line from a set of 2D points. All
| ML is basically a 10,000 (or much more) dimensional extension
| of that. The magic is lost.
|
| Most of ML research is just to find the most efficient way to
| find the best fitting curve given the least amount of data
| points. A ML guys knowledge is centered around a bunch of
| tricks and techniques to achieve that goal with some N-D
| template equation. And the general template equation is all
| the same: A neural network. The answer to what intelligence
| is seems to be quite simple and not that profound at all...
| which makes sense given that we're able to create things like
| DALL-E in such a short time frame.
| sillysaurusx wrote:
| It's the other way around. ML is cool precisely because
| it's a guitar for the mind, not a mind itself.
| https://soundcloud.com/theshawwn/sets/ai-generated-
| videogame...
|
| I made that by using ML as a guitar. I chose instruments
| and style the way a guitarist's fingers chooses frets.
|
| And saying "give me this style with these instruments" is
| far easier than recording it yourself.
|
| For what it's worth, I agree with you about AGI. https://tw
| itter.com/theshawwn/status/1446076902607888385?s=2...
|
| But for me, that means it's far more interesting than AGI.
| Everyone has their eye on AGI, and no one seems to be
| taking ML at face value. That means the first companies to
| do it will stand to make a fortune.
| deltaonefour wrote:
| Why do people use analogies to prove a point? It doesn't
| prove anything.
|
| What was your point here? ML is like a guitar? What you
| said doesn't seem to contradict anything I said other
| then you find curve fitting interesting and I don't.
|
| Not trying to be offensive here, don't take it the wrong
| way.
| fnordpiglet wrote:
| Boring. Give me emergent sentient AI, fact or fiction!
___________________________________________________________________
(page generated 2022-06-01 23:00 UTC)