[HN Gopher] DALL-E 2 has a secret language
___________________________________________________________________
DALL-E 2 has a secret language
Author : smarx
Score : 371 points
Date : 2022-05-31 18:46 UTC (4 hours ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| kazinator wrote:
| That's reminiscent of small children making up their own words
| for things. Those words are stable in that you can converse with
| the child using those words.
| notimpotent wrote:
| My first thought upon reading this: what if DALL-E (or a similar
| AI) uncovers some kind of hidden universal language that is
| somehow more "optimal" than any existing language?
|
| i.e. anything can be completely described in a more succinct
| manner than any current spoken language.
|
| Or maybe some kind of universal language that naturally occurs
| and any semi-intelligence life can understand it.
|
| Fun stuff!
| extr wrote:
| This is kind of already what's happening inside the NN. You can
| think of intermediate layers in the network as talking to each
| other in "NN-ease", that is, translating from one form of
| representation (encoding) to another. At the final encoder
| layer, the input is maximally compressed (for that given
| dataset/model architecture/training regime). The picture
| (millions of pixels) of the dog is reduced to a few bits of
| information about what kind of dog it is and how it's posed,
| what color the background is, etc.
|
| However, optimality of encoding is entirely relative to the
| decoding scheme used and your purposes. Obviously a matrix of
| numbers representing a summary of a paragraph can be in some
| sense "more compressed" than the English equivalent, but it's
| useless if you don't speak matrices. Similarly, you could
| invent an encoding scheme with Latin characters that is more
| compressed than English, but it's again useless if you don't
| know it or want to take the time to learn it. If we wanted we
| could make English more regular and easier to learn/compress,
| but we don't, for a whole bunch of practical/real life reasons.
| There's no free lunch in information theory. You always have to
| keep the decoder/reader in mind.
| astrange wrote:
| That's not possible - it's like asking for a compression system
| that can compress any message.
|
| All human languages are about the same efficiency when spoken,
| but of course this mainly depends on having short enough words
| for the most common concepts in the specific thing you're
| talking about.
|
| https://www.science.org/content/article/human-speech-may-hav...
|
| And there can't be a universal language because the symbols
| (words) used are completely arbitrary even if the grammar has
| universal concepts.
| elil17 wrote:
| There are a couple sci-fi short stories in the book "Stories of
| Your Life and Others" by Ted Chiang which explore the idea that
| highly advanced intelligences might create special languages
| which accommodate special thoughts which we cannot easily
| think.
| jcims wrote:
| I think something like this is actually quite likely.
|
| I've been wondering if there is a way to do psychological
| experiments on these large language models that we couldn't do
| with a person.
| julianbuse wrote:
| I imagine these would be very interesting, but not very
| applicable to humans (which I presume is the intended
| outcome). OTOH, since these language models are trained on
| human language and media, they might have some value. I'm
| quite split on which I think is more likely (I don't have any
| experience in ai/ml nor in psychology so what do I know).
| sbierwagen wrote:
| Ithkuil (Ithkuil: Itkuil) is an experimental constructed
| language created by John Quijada.[1] It is designed to express
| more profound levels of human cognition briefly yet overtly and
| clearly, particularly about human categorization.
|
| Meaningful phrases or sentences can usually be expressed in
| Ithkuil with fewer linguistic units than natural languages.[2]
| For example, the two-word Ithkuil sentence "Tram-mloi
| hhasmarptuktox" can be translated into English as "On the
| contrary, I think it may turn out that this rugged mountain
| range trails off at some point."[2]
|
| https://en.wikipedia.org/wiki/Ithkuil
| jws wrote:
| In short: DALLE-2 generates apparent gibberish for text in some
| circumstances, but feeding the gibberish back in gets recognized
| and you can tease out the meaning of words in this unknown
| language.
| carabiner wrote:
| Science has gone too far.
| astrange wrote:
| It seems obvious this would happen (it's just adversarial inputs
| again) - they didn't make DALL-E reject "nonsense" prompts, so it
| doesn't try to, and indeed there's no reason you'd want to make
| it do that.
|
| Seems like a useful enhancement would be to invert the text and
| image prior stages, so it'd be able to explain what it thinks
| your prompt meant along with making images of it.
| [deleted]
| schroeding wrote:
| Interesting! I wonder if the model would "understand" the made-up
| names from today's stained glass window post[1] like "Oila Whamm"
| for William Ockham and output similar images.
|
| [1] https://astralcodexten.substack.com/p/a-guide-to-asking-
| robo...
| layer8 wrote:
| Sounds like an effect similar to illegal opcodes:
| https://en.m.wikipedia.org/wiki/Illegal_opcode
| wongarsu wrote:
| Link to the 5 page paper, for those that don't like twitter
| threads:
|
| https://giannisdaras.github.io/publications/Discovering_the_...
| TOMDM wrote:
| Shouldn't this be expected to a certain extent?
|
| Gibberish has to map _somewhere_ in the models concept space.
|
| Whether is maps onto anything we'd recognise as consistent
| doesn't mean that the AI wouldn't have some concept of where it
| relates, as other people have noted, the gibberish breaks down
| when you move it into another context, but who's to say that
| Dall-E 2 isn't remaining consistent to some concept it
| understands that isn't immediately recognisable to us.
|
| The interesting part is if you can trick it to spit out gibberish
| in targeted areas of that concept space using crafted queries.
| gwern wrote:
| > Shouldn't this be expected to a certain extent?
|
| Not really. It's a stochastic model, so after a bunch of random
| denoising steps, it could easily just be mapping every bit of
| gibberish to a random image, and it be vanishingly unlikely for
| any of them to be similar or the relationship to run in
| reverse.
| codeflo wrote:
| I mean, everything is easy to predict in retrospect. :)
| Personally, I'm a bit surprised that it has learned any
| connection between the letters in the generated image and the
| prompt text at all. I had assumed (somewhat falsely it seems)
| that the gibberish means that the generator just thinks of text
| as a "pretty pattern" that it fills in without meaning. For
| example, a recent post on HN suggested that it likes the word
| "Bay", simply because that appears so often on maps.
| momojo wrote:
| > Shouldn't this be expected to a certain extent?
|
| In hindsight, sure. Given enough time someone might have
| predicted the phenomenon. But I don't think most of us did.
|
| What's more fascinating to me is how often this has happened in
| this space in just the last few years.
|
| 1. Some phenomenon is discovered
|
| 2. I'm surprised
|
| 3. It makes sense in hindsight
| jerf wrote:
| Expected after the fact, somewhat. Before hand it would not be
| unreasonable to expect that the output text and the input text
| aren't necessarily that kind of connected, though, especially
| as as I understand it, DALL-E was not given input labelling
| explaining the text in various images. To it, text is just a
| frequently-recurring set of shapes that relate to each other a
| lot. This may yet be a false positive, based on other
| discussion.
|
| That the model would have a consistent form of _some_ kind of
| gibberish would be a given. Even humans have it:
| https://en.wikipedia.org/wiki/Bouba/kiki_effect And I'm sure if
| you asked native English speakers, "Hey, we know this isn't a
| word, but if it _was_ a word, what would it be? 'Apoploe
| vesrreaitars'" you would get something very far from a
| uniformly random distribution of all nameable concepts.
| EvgeniyZh wrote:
| You could expect that gibberish is distributed uniformly in
| latent space, disconnected from it's langual counterpart --
| after all those are textual inputs that model have never seen,
| and it can't even map words it have seen many times to their
| writing in image properly: "seafood" word and "seafood" image
| are in the same place in latent space, but "seafood" word in
| image isn't. Yet some gibberish word in image is, and also the
| same gibberish word is. It's very counterintuitive for me.
| TOMDM wrote:
| A uniform distribution makes sense for gibberish, not
| something I'd considered.
|
| A counterpoint I'd raise is I wonder how aggressive Dall-E 2
| is in making assumptions about words it hasn't seen before.
|
| Hard to do given that it's read essentially the entire
| internet, however someone could make up some latin-esque
| words that people would be able to guess the meaning of.
|
| If the model is as good as people at assuming the meaning of
| such made up words, it could stand to reason that if it were
| aggressive enough in this it might be doing the same thing
| with gibberish and thus ending up with it's own
| interpretation of the word, which would land it back in a
| more targeted concept space.
|
| I'd love to see someone craft some words that most people
| could guess the meaning of, and see how Dall-E 2 fairs.
| jamal-kumar wrote:
| This is really interesting because I was just looking at
| gibberish detection using GPT models. Seems like mitigating AI
| with AI doesn't sound like it's all that secure since you can
| probably mess with the gibberish detection similarly - Or maybe
| the 'secret language' as they're calling it here passes GPT
| gibberish detection? [1]
|
| [1] https://arr.am/2020/07/25/gpt-3-uncertainty-prompts/
| [deleted]
| GamerUncle wrote:
| https://nitter.net/giannis_daras/status/1531693093040230402
| 726D7266 wrote:
| Possibly related: In 2017 AI bots formed a derived shorthand that
| allowed them to communicate faster:
| https://www.facebook.com/dhruv.batra.dbatra/posts/1943791229...
|
| > While the idea of AI agents inventing their own language may
| sound alarming/unexpected to people outside the field, it is a
| well-established sub-field of AI, with publications dating back
| decades.
|
| > Simply put, agents in environments attempting to solve a task
| will often find unintuitive ways to maximize reward.
| joshstrange wrote:
| Which, to a lessor extent, isn't too terribly different from
| humans if you think about. We don't use a full new language but
| every profession has it's own jargon. Some of it spans the
| whole industry and some is company-specific.
| gibolt wrote:
| Unintuitive to biased humans. The solutions may actually be
| super intuitive/efficient, and we just can't wrap our heads
| around it yet
| neopallium wrote:
| Would it be possible to build a rosetta stone for this secret
| language with prompts asking for labeled pictures of different
| categories of objects? Or prompts about teaching kids different
| words?
| MaxBorsch228 wrote:
| What if give it the same promt but "with subtitles in French" for
| example?
| [deleted]
| jsnell wrote:
| One of the replies is a thread with a fairly convincing rebuttal,
| with examples:
|
| https://twitter.com/Thomas_Woodside/status/15317102510150819...
| dwallin wrote:
| I'm not sure it's a convincing rebuttal, the examples shown all
| seem to have some visible commonality.
|
| Eg. "Apoploe vesrreaitais" Could refer to something along the
| lines of a "fan / wedge" or "wing-like"
|
| If you look at the examples of cheese, when compared to the
| "birds and cheese" the cheese tends to be laid out in a fan
| like pattern and shaped in sharp angled wedges.
| sudosysgen wrote:
| It seems to refer to "bird plant" which means birds on trees,
| so it would make sense there would be cheese and plants if it
| can't find how to fit a bird.
| joshcryer wrote:
| Yeah, and his example about bugs in the kitchen. Everything
| is edible and 'wild' or 'heirloom' and "contarra ccetnxniams
| luryca tanniounons" comes from the farmers talking about ...
| vegetables. So there's a definite interrelationship between
| the 'words' and the images.
|
| I'm unconvinced by the rebuttal as well, not to say I am
| convinced we have a fully formal language going on here, but
| there's definitely some shared concepts with the generated
| text.
|
| I wonder what imagen would come up with or if it's 'language'
| is more correlated to real language.
| ericb wrote:
| > Apoploe vesrreaitais" Could refer to something along the
| lines of a "fan / wedge"
|
| "feathered" maybe?
| f38zf5vdt wrote:
| I'm curious what it generates when given randomly generated
| strings of seemingly pronounceable words like "Fedlope
| Dipeioreitcus".
| jimhi wrote:
| We don't know the rules or grammar of this "language". Maybe
| nouns change based on how they are used
|
| https://en.wikipedia.org/wiki/Declension
| lmc wrote:
| A rebuttal to the rebuttal (without examples)...
|
| How many French people speak Breton?
| goodside wrote:
| My first reaction to this was, "It probably has to do with
| tokenization. If there's a 'language' buried in here, its native
| alphabet is GPT-3 tokens, and the text we see is a concatenation
| of how it thinks those tokens map to Unicode text."
|
| Most randomly concatenated pairs of tokens simply do not occur in
| any training text, because their translation to Unicode doesn't
| correspond to any real word. There are also combinations that do
| correspond to real words ("pres" + "ident" + "ial") but still
| never occur in training because some other tokenization is
| preferred to represent the same string ("president" + "ial").
|
| Maybe DALL-E 2 is assigning some sort of isolated (as in, no
| bound morphemes) meaning to tokens -- e.g., combinations of
| letters that are statistically likely to mean "bird" in some
| language when more letters are revealed. When a group of such
| tokens are combined, you get a word that's more "birdlike" than
| the word "bird" could ever be, because it's composed exclusively
| of tokens that mean "bird": tokens that, unlike "bird" itself,
| never describe non-birds (e.g., a Pontiac Firebird). The exact
| tokens it uses to achieve this aren't directly accessible to us,
| because all we get is poorly rendered roman text.
|
| I'm maybe not the ideal person to be speculating about this, but
| it bothers me that the word "token" isn't even mentioned in the
| article reporting this discovery (https://giannisdaras.github.io/
| publications/Discovering_the_...).
| normaldist wrote:
| I'm seeing a lot more people experimenting with DALL-E 2.
|
| How does getting access work, do you need a referral?
| mikequinlan wrote:
| https://labs.openai.com/waitlist
| minimaxir wrote:
| There is a waitlist, but OpenAI just announced they are opening
| access more widely from it.
| Cloudef wrote:
| I wonder why they call it "Open"AI
| MatthiasPortzel wrote:
| It's wild to see the discoveries being made in ML research. Like
| most of these 'discoveries,' it makes a fair amount of sense
| after thinking about it. Of course it's not just going to spit
| out random noise for random input, it's been trained to generate
| realistic looking images.
|
| But I think it is an interesting discovery because I don't think
| anyone could have predicted this.
|
| One of my favorite examples is the classification model that will
| identify an apple with a sticker on it that says "pear" as a pear
| --it makes sense, but is still surprising when you first see it.
| astrange wrote:
| > One of my favorite examples is the classification model that
| will identify an apple with a sticker on it that says "pear" as
| a pear--it makes sense, but is still surprising when you first
| see it.
|
| That classification model (CLIP) is the first stage of this
| image generator (DALLE) - and actually this shows that it
| doesn't think they're exactly the same thing, or at least
| that's not the full story, because DALL-E doesn't confuse the
| two.
|
| However, other CLIP guided image generation models do like to
| start writing the prompt as text into the image if you push
| them too hard.
| wongarsu wrote:
| Was DALL-E 2 trained on captions from multiple languages? If so,
| this makes a lot of sense. Somewhere early in the model the words
| "bird", "vogel", "oiseau" and "pajaro" have to be mapped to the
| same concept. And "Apoploe vesrreaitais" happens to map to the
| same concept. Or maybe "Apoploe vesrreaitais" is rather the
| tokenization of that concept, since it also appears in the
| output. So in a sense DALL-E is using an internal language to
| make sense of our world.
| link0ff wrote:
| This looks like the artificial language Lojban was constructed:
| its words share parts from completely unrelated languages to
| the point when none of the original words are recognizable in
| the result.
| alxndr wrote:
| The original words aren't recognizable at first glance, but
| they do serve as potential mnemonics for remembering the
| terms/definitions for any learners who speak one of those
| source languages (English, Spanish, Mandarin, Arabic,
| Russian, Hindi)
| melony wrote:
| But that's expected behavior for a language model (especially
| VAEs), where's the novelty? In a VAE, the vectors are
| probabilistic in the latent space so this is basically the NLP
| version of the classic VAE facial image generation where you
| can tweak the parameters to emphasize or de-emphasize a
| feature.
| tomrod wrote:
| Novel in engineering together of multiple concepts, if
| nothing else!
| la64710 wrote:
| Does google translate supports this?
| godelski wrote:
| Interestingly Google detects these words as Greek. I know they
| are nonsensical and not actually Greek but I'm wondering if any
| Greek speakers might be able to provide some insights. Are these
| gibberish words close to meaningful words? (clear shot in the
| dark here) Maybe a linguist could find more meaning?
| deckeraa wrote:
| One could conjecture that "Apoploe" is similar to apo pouli,
| "from bird". But I don't have much support for that conjecture.
| PartiallyTyped wrote:
| The word is apoplous, or apoploI
| noizejoy wrote:
| Or maybe it's a subtle joke by Google as a play on the idiom
| "it's all Greek to me"?
| PartiallyTyped wrote:
| As a native Greek, no, they don't make any sense.. sort of. My
| hunch is that they read significantly more like Latin than they
| do Greek. However it tells us something about google translate.
|
| The reason "Apoploe vesrreaitais" is detected as Greek is
| because the first "word" is "phonetically" similar to the word
| apoplous, which means sailing/shipping and it is rooted in
| ancient Greek. If we were to write Apoplous using roman
| characters, we would write apoplous or apoloi (plural, in Greek
| is apoploI). So I think that the model understands that "oe"
| suffix is used to represent the Greek suffix "oi" that is used
| for plurals. The rest of the word is rather close phonetically,
| so there is some model that maps phonetic representations to
| the correct word.
|
| The other phrase seems to be combined of words classified as
| Portuguese, Spanish, Lithuanian, and Luxembourgish.
| stavros wrote:
| I don't think that's how language detection works, they most
| likely use the frequencies of n-grams to detect language
| probability. It's still detected as Greek if you change to
| "Apoulon vesrreaitais", just because it kind of looks the way
| Greek words look, not because it resembles any specific word.
| PartiallyTyped wrote:
| You are wrong. Had it been that simple I would __not__ have
| suggested that and for whatever reason I find your reply
| borderline infuriating but I can't pinpoint exactly why
| that is.
|
| Regardless, here is me, a native speaker, disproving your
| hypothesis.
|
| I tried the following words in google translate elefantas
| ailaifantas ailaiphantas elaiphandas elaiphandac.
|
| The suggested detections are elephantas, ailaiphantas,
| ailaiphantas, elaiphantas, elaiphantas, however, the
| translations are elephant, illuminated, illuminated,
| elephant, elephant respectively. The first is correct. When
| mapping the roman characters back to greek, there is loss
| of information, this is seen in the umlaut above iota which
| makes the pronunciation from e [e] - like to ai [ai], and
| the emphasis denoted via the mark above epsilon (e).
|
| Notice that all all the words have an edit distance of >=4,
| a soundex distance of at most 1, and a metaphone distance
| of at most 1 [1]. The suggested words as I said above are
| near homophones of the correct word bar a few minor
| details.
|
| [1] http://www.ripelacunae.net/projects/levenshtein
| stavros wrote:
| > for whatever reason I find your reply borderline
| infuriating but I can't pinpoint exactly why that is.
|
| I guess that says more about you than about my reply.
| Also, I'm a native speaker as well. That doesn't really
| have any bearing, my comment above comes from what I know
| about common implementations of language detection
| algorithms, not so much from looking at how Google
| Translate behaves.
| PartiallyTyped wrote:
| And I was honest about how I felt given how you
| structured it.
|
| It does have a lot of bearing actually. While I am a
| native speaker, my spelling skills are atrocious as
| everything is a sequence of sounds in my head more so
| than a sequence of letters. To get around my spelling
| issues I frequently use homophones to find the correct
| spelling of a word which uses soundex or similar
| algorithms to find the correct word along with character
| mappings between the two languages.
|
| Regardless, I believe I have proved the hypothesis to not
| be true.
| godelski wrote:
| This is a great response (I also suspected we'd learn
| something from the Google Translate black box). And I agree
| with the idea of being closer to Latin gibberish. The
| phonetic relationships are a great hint to what's actually
| going on.
|
| My hypothesis here is more that these models are trained more
| on western languages than others and thus our latent
| representation of "language" is going to appear like Latin
| gibberish due to a combination of the evolution of these
| languages as well as human bias. ("It's all Greek to me")
| PoignardAzur wrote:
| Wait, how does that make any sense?
|
| I thought DALL-E's language model was tokenized, so it doesn't
| understand that eg "car" is made up of the letters 'c', 'a' and
| 'r'.
|
| So how could the generated pictures contain letters that form
| words that are tokenized into DALL-E's internal "language"?
| Shouldn't we expect that feeding those words to the model would
| give the same result as feeding it random invented words?
|
| Actually, now that I think about it, how does DALL-E react when
| given words made of completely random letters?
| seydor wrote:
| damn. i hope arcaeologists can use that to decipher old scripts
| ricardobeat wrote:
| The paper is just as long as the twitter thread.
| smusamashah wrote:
| A few days ago I was wondering what DALL-E would generate if
| given gibberish (tried to request which wasn't entertained). This
| sounds like an answer to that to some extent.
|
| I think, there will be multiple words for the same thing. Also,
| unlike 'bird' the word 'Apoploe vesrreaitais' might actually mean
| specific kind of bird in specific setting.
| DonHopkins wrote:
| Has anyone tried talking to it in Simlish?
|
| https://en.wikipedia.org/wiki/Simlish
|
| https://web.archive.org/web/20040722043906/http://thesims.ea...
|
| https://web.archive.org/web/20121102012431/http://bbs.thesim...
| ml_basics wrote:
| I find it really interesting how these new large models (DALLE,
| GPT3, PaLM etc) are opening up new research areas that do not
| require the same massive resources required to actually train the
| models.
|
| This may act as a counter balance to the trends of the last few
| years of all major research becoming concentrated in a few tech
| companies.
| YeGoblynQueenne wrote:
| If I understand correctly from the twitter thread (I haven't read
| the linked technical report) the author and a collaborator found
| that DALL-E generated some gibberish in an image that showed two
| men talking, one holding two ... cabbages? They fed (some of) the
| gibberish back to DALL-E and it generated images of birds,
| pecking at things.
|
| Conclusion: the gibberish is the expression for birds eating
| things in DALL-E's secret language.
|
| But, wait. Why is the same gibberish in the first image, that has
| the two men and the cabbages(?), but no birds?
|
| Explanation: the two men are clearly talking about birds:
|
| >> We then feed the words: "Apoploe vesrreaitars" and we get
| birds. It seems that the farmers are talking about birds, messing
| with their vegetables!
|
| With apologies to my two compatriots, but that is circular
| thinking to make my head spin. I'm reminded of nothing else as
| much as the scene in the Knights of the Round Table where the
| wise Sir Bedivere explains why witches are made of wood:
|
| https://youtu.be/zrzMhU_4m-g
| throw457 wrote:
| I bet it's just a form of copy protection.
| ceejayoz wrote:
| Like https://en.wikipedia.org/wiki/Trap_street?
| 867-5309 wrote:
| and Wagatha
| Imnimo wrote:
| I tried a few of these in one of the available CLIP-guided
| diffusion notebooks, but wasn't able to get anything that looks
| like DALL-E meanings. Not sure if DALL-E retrained CLIP (I don't
| think they did?), but it maybe suggests that whatever weirdness
| is going on here is on the decoder side?
|
| All the cool images that DALL-E spits out are fun to look at, but
| this sort of thing is an even more interesting experiment in my
| book. I've been patiently sitting on the waitlist for access, but
| I can't wait to play around with it.
| dpierce9 wrote:
| Gavagai!
| alxndr wrote:
| (explaining the joke:
| https://en.m.wikipedia.org/wiki/Indeterminacy_of_translation )
| ortusdux wrote:
| I wonder if any linguists are training a neural network to
| generate Esperanto 2.0.
| Veedrac wrote:
| Wow, I am totally going to need to wait for more experimentation
| before believing any given thing here, but this seems like a big
| deal.
|
| It's one thing if DALL-E 2 was trying to map words in the prompt
| to their letter sequences and failing because of BPEs; that shows
| an impressive amount of compositionality but it's still image-
| model territory. It's another if DALL-E 2 was trying to map the
| prompt to semantically meaningful content and then failing to
| finish converting that content to language because it's too small
| and diffusion is a poor fit for language generation. That makes
| for worse images but it says terrifying things about how much
| DALL-E 2 has understood the semantic structure of dialog in
| images, and how this is likely to change with scale. Normally I'd
| expect the physical representation to precede semantic
| understanding, not follow it!
|
| That said I reiterate that a degree of skepticism seems warranted
| at this point.
| trebligdivad wrote:
| Is this finally a need for a xenolinguist?
___________________________________________________________________
(page generated 2022-05-31 23:00 UTC)