[HN Gopher] Teaching GPT-3 to reverse words
___________________________________________________________________
Teaching GPT-3 to reverse words
Author : ascertain
Score : 69 points
Date : 2022-05-15 19:42 UTC (1 days ago)
(HTM) web link (twitter.com)
(TXT) w3m dump (twitter.com)
| jameshart wrote:
| Oh, I'm so looking forward to my next coding interview.
|
| "Okay, could you show me on the whiteboard how you might go about
| writing a program that can reverse a string?"
|
| "Great, so I'm going to start by initializing a simple
| transformer-based neural network with 175 billion parameters and
| 96 attention layers, and I'm going to train it on a corpus of 45
| terabytes of data tokenized into about 500 billion tokens..."
| oneepic wrote:
| "Cool, so what do you think would be the time complexity of
| that? Do you think we can maybe do better than that?"
| visarga wrote:
| Not if you also want a short poem where each word starts with
| a letter from the original word, and then a short literary
| commentary on it.
| jameshart wrote:
| Actually it turns out it's O(n). Which goes to show that
| constant factors can be more important than you think when
| looking at raw time complexity big-O.
| [deleted]
| [deleted]
| agluszak wrote:
| https://nitter.net/npew/status/1525900849888866307
| swid wrote:
| It's funny to me that this kind of usage of GPT is just
| programming with a lot of extra steps.
| convolvatron wrote:
| I was just thinking the opposite - that by choosing such a tiny
| problem one might be able to actually develop some intuition
| about what's going on inside that very black box
| swid wrote:
| I meant it mostly as a joke, but there is a certain amount of
| irony to it. This goes way beyond prompt engineering - he
| wrote an algorithm to run on GPT in a way you would not
| expect a non-programmer to write. I think the idea is cool
| and the process to write it was revealing.
| mberning wrote:
| Right. What non-programmer is going to think to turn a word
| into character list with positional metadata sprinkled in.
| visarga wrote:
| I used a similar technique for a completely unrelated
| task. My "original" idea.
| jameshart wrote:
| It's actually weirdly similar to the kind of tricks
| people use for mental feats like memorizing the order of
| a complete deck of cards or repeating back a long list of
| words in reverse order.
|
| When you think about every mental task GPT3 is being
| asked to do as being something it is being asked to
| perform _immediately_ and _without having prepared_ and
| _as fast as possible_ this makes a lot more sense.
|
| Like, a reasonable human response to "quick! What's
| encyclopedia backwards?!" Would be more like
|
| "Er.. right. A. I. D. E. O? Oh wait is it one of those OE
| ligature things? P. A. No, O. P. Hang on did I already
| say P?"
| jameshart wrote:
| If you just ask GPT-3 text-davinci-002 to complete
| Create a Python program to reverse a string:
|
| It produces def reverse(s):
| return s[::-1]
|
| And that isn't even the code-specific model.
| mrfusion wrote:
| > Tokens are chunks of characters. For example, the word
| "alphabet" gets broken up into the tokens "alph" and "abet".
|
| I didn't know that. Seems like it would confuse it during
| training. Anyone able to explain?
| aeternum wrote:
| Humans also think about words in terms of subcomponents,
| languages make heavy use of prefixes and suffixes for example.
| SemanticStrengh wrote:
| This is not the same.. The masks are randomized and lossy.
| Although yes there is potential for a transformer specially
| trained to segment prefixes/affixes/suffixes, it might
| augment some of its encoding abilities, see e.g spanbert for
| a related example of opportunity.
| MauranKilom wrote:
| What do you mean with "lossy"? What information is being
| lost? Or do you just mean that there isn't necessarily a
| unique way to encode a given string?
| SemanticStrengh wrote:
| This is masked token learning, which is used e.g by BERT. This
| is obscolete and alternatives such as XLNET are much superior
| but there is too much inertia in the industry and newer large
| models are still built with the same lossy encoding..
| gattilorenz wrote:
| If I recall correctly, it's similar to how fasttext vectors
| work. For fasttext, this means that the representation of words
| is dependent to a certain extent to its morphemes (not really,
| but bear with me), so rare/inflected words can have a better
| representation due to the similarity with words that are
| similar-looking and more frequent (e.g. "unconstitutional"
| might never appear in the training data, but the system can
| approximate its meaning by composing that of "un", which it has
| seen in words such as "unbelievable", and the remaining
| subtokens, that come from the word "constitutional" that was
| present in the training set)
|
| Not sure if the same thing happens here, tho
| 6gvONxR4sf7o wrote:
| The alternatives are learning at the character level (way more
| complex, and scales badly in memory/compute), or learning at
| the whole word level (needs absurdly massive dictionary of
| words, and still can't handle really rare/novel words).
| Breaking things into a set of subwords that allows you to
| encode any string solves lots of problems and is the relatively
| standard way to do things these days.
| gwern wrote:
| > The alternatives are learning at the character level (way
| more complex
|
| No, BPEs are more complex: you have a whole additional layer
| of preprocessing, with all sorts of strange and
| counterintuitive downstream effects and brand new ways to
| screw up (fun quiz question: everyone knows that BPEs use
| '<|endoftext|>' tokens to denote document breaks; what does
| the string '<|endoftext|>' encode to?). BPEs are reliably one
| of the ways that OA API users screw up, especially when
| trying to work with longer completions or context windows.
|
| But a character is a character.
|
| > and scales badly in memory/compute)
|
| Actually very competitive:
| https://arxiv.org/abs/2105.13626#google (Especially if you
| account for all the time and effort and subtle bugs caused by
| BPEs.)
| andrewmutz wrote:
| I believe GPT-3 uses byte pair encoding, which allows it to do
| tokenization in a language-neutral manner:
|
| https://en.wikipedia.org/wiki/Byte_pair_encoding
| axiom92 wrote:
| Yeah it's BPE. OpenAI has a nice tool that allows you to play
| with the tokenizer https://beta.openai.com/tokenizer.
| mrfusion wrote:
| I thought I read it uses word2vec?
| a65cec93b wrote:
| > GPT-3 correctly reverses long words! But to get there, we had
| to teach GPT-3 the algorithm to use to get around its
| limitations.
|
| Has GPT-3 really been "taught" anything here? If you don't
| provide an explicit example as the context of your input, GPT-3
| does not retain the ability to reverse words.
| f38zf5vdt wrote:
| No, it isn't taught anything. GPT3 text generation is
| effectively a really fancy autocompletion algorithm based on
| the n-many previous tokens in a rolling window. You can only
| "teach" GPT3 something within that window, and it doesn't
| "learn" there, it just tries its best to generate content based
| on what is stored in its massive n-dimension table of graph
| edges for tokens.
|
| That is also why it has such a strong propensity to lose the
| plot once you are outside of that window size and it's
| generating new content based on self-generated content.
| yunyu wrote:
| You can update the "graph edges" with content longer than the
| window by fine tuning:
| https://beta.openai.com/docs/guides/fine-tuning
| f38zf5vdt wrote:
| Yes, training the model is where it learns, not in prompts.
| Prompting might be considered meta-learning but it will
| always need a reference point given to it from its training
| data, and beyond the prompt the original model is never
| altered.
| skybrian wrote:
| You're right for GPT 3, but it's an example of chain of thought
| reasoning, which seems to be a new area of research [1] and
| might get integrated into newer versions:
|
| [1] https://arxiv.org/abs/2201.11903
| tiborsaas wrote:
| I'm not sure how you define teaching, but for me getting shown
| an example and then repeating it successfully with another
| input does mean teaching/learning. I know the model doesn't
| update though, let's not focus on that now.
|
| If anthropomorphizing bothers you, then we could just use
| "prompting", but I feel teaching is a good enough approximation
| here.
| f38zf5vdt wrote:
| It's repeating based on what the trained model has given it
| about situations where instructions possibly similar to the
| instructions given are specified and which were about
| reversing strings in general.
|
| If the author messed with temperature and retried their
| failing prompt enough times, or simply reworded it a little
| differently, they might also get the correct answer.
| jxy wrote:
| That's easy to solve. Prepare all K-12 text books as prompts,
| and train another GPT-N to go from input to those prompts, then
| feed these prompts to the current GPT-3.
|
| Can we get a GPT-N-3 this way to do SAT?
| Der_Einzige wrote:
| Part of the problem here is that GPT-3 has such a small
| vocabulary. It's 50K tokens, and many of those are either
| garbage, punctuation, or full words (rather than sub words).
|
| I'd be curious to see what scaling up the size of the vocabulary
| would do to improve these results in a model like GPT-3...
| axiom92 wrote:
| 50k is not the number of unique words that GPT-3 supports, and
| perhaps you're referring to the BPE tokens. The input to GPT-3
| is not tokenized by splitting on spaces, and is based on byte-
| pair encoding tokens. You can play with it here:
| https://beta.openai.com/tokenizer.
|
| A rare word like _blithe_ is tokenized into two BPE tokens: bl
| and ithe, whereas common words like _the_ get their own token.
| rprenger wrote:
| I don't think a larger vocab would help. All the individual
| letters are in the ~50k token vocab already, but the word
| "alphabet" will still not get tokenized to [a, l, p, h, a, b,
| e, t]. Using a larger vocab like PaLM's 256k vocab would have
| the same issue.
___________________________________________________________________
(page generated 2022-05-16 23:00 UTC)