[HN Gopher] (Unsuccessfully) Fine-tuning GPT to play "Connections"
___________________________________________________________________
(Unsuccessfully) Fine-tuning GPT to play "Connections"
Author : danielcorin
Score : 63 points
Date : 2024-01-15 17:02 UTC (5 hours ago)
(HTM) web link (www.danielcorin.com)
(TXT) w3m dump (www.danielcorin.com)
| epiccoleman wrote:
| This is pretty interesting. Intuitively, Connections is the kind
| of thing I would expect GPT to _not_ be good at, because almost
| every day there 's something that feels kind of "out of left
| field" in the categories. In my experience LLMs are good at
| regurgitating the "standard" take on a topic, or "best
| practices", but lack the creativity and out-of-the-box thinking
| that makes Connections fun.
|
| On the other hand, it feels like the kind of thing where an LLM
| might be _surprisingly_ good, because it could, in theory, be
| able to see more correlations than a human can. Based on these
| results I guess my intuition seems to hold up.
|
| I wonder if a better / different way to approach this could be
| more "algorithmically" - maybe have the LLM generate a list of
| possible categories for each individual word and then try to
| operate on those associations?
|
| Cool article!
| matsemann wrote:
| The "whole point" of embeddings is that words have a vector
| that represents how well that word fits into a certain
| categories, so words belonging together is close in that vector
| space. So in that sense it almost feels like this should be
| solvable using something simpler than a full LLM. To "just" get
| the embeddings of the words, and then find the groups of 4 that
| minimizes the total distances within the groups.
| basil-rash wrote:
| The problem is Connections is designed to use a tons of
| alternate definitions and other vaguities that aren't well
| modeled in typical embeddings. Today's for instance
| (spoilers!!) has Coat, Green, Pod, and Soup as being linked
| for them matching "Pea ___". No embedding would relate them
| at all, unless that suffix is known a priori.
| RheingoldRiver wrote:
| This game looks cool but wow the UX is terrible. Why can't you
| click & drag the words to reorder them? Seems half the difficulty
| is keeping track of your thought process with the inability to
| make a draft state.
| pmelendez wrote:
| >Why can't you click & drag the words to reorder them? That
| level of difficulty is part of the game. There is a shuffle
| button to ease the ideas generation but most likely is was done
| like that by design.
| matsemann wrote:
| Also, there is no reason to limit the amount of guesses. Just
| let me try until I figure it out. But no, they've put a limit
| so that it can be "sharable" in a tweet-sized text, to try to
| copy the viralness of Wordle. But they do it to the detriment
| of the gameplay, in such a way that I don't even bother
| playing.
| krainboltgreene wrote:
| The whole point of the game is to do it within a bounded set
| of moves.
| matsemann wrote:
| No, it scores you based on the number of moves used. No
| need for an upper bound, could've let me use 20 guesses if
| that's what it takes (non native speaker). But that
| wouldn't fit their copy&paste result formatting..
| n2d4 wrote:
| Disagreed, the limit is what gives the game a constraint and
| makes it interesting IMHO. I like to have something that
| makes me fail because I care less about optimizing a score,
| more about beating it in the first place. Different people
| play games differently, etc.
|
| I also don't see how it makes it "sharable". Wouldn't it be
| more sharable if they let everyone win and just give them a
| score?
| aidos wrote:
| The format in the show it's lifted from (Only Connect -
| greatest game show ever) is that the teams have 2 minutes
| total to solve the "connecting wall". They can have as many
| guesses as they want until they solve the first two groups -
| after that it's 3 strikes and you're out.
| charcircuit wrote:
| It's a shame you can't just see the probability distributions for
| the 16 words and choose them yourself that way you never
| hallucinate a word and the groups are always 4 words long.
| meatmanek wrote:
| I've played around with the same problem, though I didn't do any
| fine-tuning. Some strategies that seemed promising:
| - A two-pass approach where you prompt it to generate the
| groupings, then separately prompt it to find the words that
| belong into each group. (Which of the following words best fit
| the category "COMMON DOG NAMES"?). It does way better at the more
| specific queries. - Don't tell it the constraints of 4
| groups of 4 words; ask it for at least four groups of 2 or more
| words. Once you have 4+ groups of 4+ words, you can make logical
| inferences with your Python wrapper to come up with all the
| possible 4x4 groupings. If you're lucky there will only be one.
| If not... more queries to GPT, I guess, but I haven't figured
| this part out.
| xnorswap wrote:
| This game is well known in the UK as the "Connecting Wall" from
| Only Connect.
|
| This result - poor Chat GPT performance - surprises me. I thought
| pattern detection and set forming was something that Chat GPT
| could do well. Perhaps it would need a model to be specifically
| trained for this task. If alpha-zero can master chess, then
| surely this game isn't beyond what is trainable.
|
| You can prompt Chat GPT that it'll be playing the connecting wall
| without having to explain the game. It still fails to make a good
| set of connections when provided the wall.
|
| One interesting part of the "Connecting Wall" sets is that there
| is almost always a "Wordy one" involving changing a letter,
| adding a prefix, anagrams, etc. Almost always a "Person" one for
| example there'll be a set of "Famous people named Tom..." but not
| a set of "Toms" with a set of "Margarets", and then a couple of
| general sets.
|
| This is a huge help given the 2 minutes and 30 seconds provided.
|
| On another note, it's possible that the GCHQ puzzle book would be
| in the training set, which has many puzzles with solutions for
| this format and a very similar rubrik with 55 items and sets of
| sizes 1 through 10. That said, Chat GPT perhaps would not tie the
| answers in the back of the book to the solutions in the front.
|
| I all, I think an AI trained for this purpose with problems and
| given solutions ought to end up mastering this format. But a
| general purpose chat GPT seems like it performs very badly.
| archgoon wrote:
| Chess has a well defined set of correct solutions. The rules
| are well known and understood.
|
| Connections is much less so.
| monsieurbanana wrote:
| Neophyte question:
|
| Can we infer anything about what llm's can achieve from what we
| can achieve with AIs like AlphaGo? I thought their approaches
| were completely separated
| themoonisachees wrote:
| Not really;
|
| Gpts are a class of text predictors. Ultimately they are
| ranked on whether or not the output is similar to the
| training data, text-wise. If the training data included a
| game then it may be able to play that game, but only if that
| game requires reasoning about entire words (because of
| tokenization, gpts can't reason in terms of letters, that's
| why they do poorly at crosswords for example)
|
| On the flip side, alphazero is a class of networks that have
| a list of actions they can take, and a list of parameters
| they observe about the game (in chess: the board position, in
| other games: their position on screen, score, speed, etc).
| The model is then trained to take actions that maximize an
| actual hard value from the game, like winning a game of
| chess, capturing a piece, increasing a score, driving the
| furthest.
|
| In theory you could train a model with the alphago method to
| do text prediction, but LLMs are called "large" for a reason,
| the input and output spaces would have to be the number of
| possible tokens (and at that point just train a normal gpt,
| it's much more efficient). Also in theory you could train a
| gpt to play games, but you're spending huge amounts of
| compute evaluating extraneous words in the input (the prompt)
| and the output (most words do not have anything to do with
| your game). on top of that, you're iterating over every word
| you generate to generate the next one, so you're doing
| multiple passes of this largely infficient computing, which
| means you're slower compared to a tailor-made model that can
| evaluate one situation once and give you a list of outputs to
| perform.
|
| in this specific case it's a bit wierd because the input
| space for the alphazero model would have to be every word
| that can appear on the board, but the reasoning part is most
| likely not a problem given enough model size. since it's
| competing with a multi-gigabyte llm though, there is space to
| spare.
| jw1224 wrote:
| > This result - poor Chat GPT performance - surprises me. I
| thought pattern detection and set forming was something that
| Chat GPT could do well
|
| I would speculate it's struggling because of the linear nature
| of its output, and the red-herring words which crossover
| between categories.
|
| Because the model can't "look ahead", it starts spitting out
| valid combinations, but without being able to anticipate that
| committing to a certain combination early on will lead to a
| mistake later.
|
| I expect if you asked it to correct its output in a followup
| message, it could do so without much difficulty.
| mtlmtlmtlmtl wrote:
| Not sure how Alpha Zero is relevant to whether a transformer
| can play connections. Alpha zero is not a transformer and chess
| is not connections.
| firebaze wrote:
| ChatGPT4 solved today's riddle in the first try for me.
| Caution, spoilers ahead:
| https://chat.openai.com/share/0c40a0b5-ab8f-4094-a7cc-21bb94...
|
| (it even ignored some embarrassing typos ...)
| mbb70 wrote:
| Doesn't this list the words in the order that they are
| grouped? The article states that randomizing the words
| completely eliminates any successful results
| jph00 wrote:
| It didn't solve it -- instead it simply created groups in the
| exact order you provided.
| benpacker wrote:
| I unfortunately can't imagine having time to test this, but I
| imagine there may be a way to accomplish this with embeddings.
|
| The game itself is sort of an embeddings clustering problem, with
| the added difficulty that each group needs to only be alike in 1
| way (versus a full vector distance which measures how alike they
| are in _every_ way).
|
| Maybe there is some way to search for a vector of weights, which,
| when multiplied by all members of a group of 4, produces weighted
| vectors with the least distance from their center? And then it's
| an optimization problem to find the 4 groups that find minimize
| the total distance from each groups center?
|
| It may be possible to find a weight vector that selects for a
| particular slice of a words meaning
| noxvilleza wrote:
| That approach works well for a game like [Codewords](https://en
| .wikipedia.org/wiki/Codenames_(board_game)) where you're trying
| to find a single-word common hint between many of your words
| (that doesn't hit any of the other words).
|
| My feeling is that it'll struggle with word-plays in
| OnlyConnect/Connections (like missing letters, added letters,
| words-within-words homophones, etc) as well as two-step
| references (such as {Venice, Dream, Night, Nothing} => "last
| words of Shakespeare plays"}).
| jappgar wrote:
| I've found that for tasks like this you need to ask the model to
| output its reasoning prior to outputting the "answer".
|
| See "chain of thought" e.g
|
| https://arxiv.org/abs/2201.11903
| orzig wrote:
| Love a published null result!
| amayne wrote:
| You'll _probably_ get better results by putting the examples only
| in the completion part of the training examples.
|
| GPT-3.5 learns how to generalize better when it's just in the
| completion.
|
| This is the same problem that vexed the researchers who did the
| paper on the alleged reversal curse.
|
| (https://andrewmayne.com/2023/11/14/is-the-reversal-curse-rea...)
| daemonologist wrote:
| I suspect the inability of the model to "plan ahead" is a
| significant contributor to its poor performance relative to a
| human. Being able to check a grouping to be sure it includes at
| least four words _and_ to check that it doesn't conflict with the
| other three groupings is a major advantage - it's pretty common
| that these puzzles include partial or incompatible red herring
| groups.
|
| If this is the case, performance might be improved by taking the
| final solving responsibility away from the model and giving it to
| the script. You could ask GPT for categories, ask whether each
| word fits each category (discarding categories with fewer than 4
| words), and then search for 4 non-overlapping categories.
|
| (This might be missing the point of the exercise though.)
| selimthegrim wrote:
| For a second I thought James bloody Burke was going to enter the
| chat
| esafak wrote:
| What would be bad about that? His Connections documentary was
| great. And apparently rebooted last year:
| https://curiositystream.com/video/6468
| binarymax wrote:
| You need to model how a person actually plays connections. Start
| with the most obvious group that has the least ambiguity, and
| then your problem space is smaller on category 2, then the same
| for category 3 and 4.
|
| So really you could fine tune 3 models - one for 16 words, one
| for 12, and one for 8. Then use them in succession.
|
| Also, if you come across a mistake at the end (have some negative
| examples in the training sets), tell it to start over and add to
| the prompt what you think is NOT a group.
| thomasahle wrote:
| It might even be easier to pick an arbitrary word, and ask it
| to find the three that matches it.
|
| Asking GPT to just pick any group, adds a lot of extra "mental
| overhead".
|
| Though of course this works best if all the groups are roughly
| of the same difficulty.
| s1mon wrote:
| I spent a bunch of time manually just using GPT4 with fairly
| simple prompts and giving it the same feedback that the game
| gives. There's an archive of puzzles which I used to try to train
| it with, and sometimes it would be very successful, and sometimes
| it was frustrating how bad it was at doing basic things like
| keeping track of what words it had used so far. Each day I would
| also have it play the new puzzle from the NYTimes which it
| couldn't have trained on. Some days it did perfectly some it made
| really stupid mistakes. It seems like a more concerted effort
| could achieve better results.
___________________________________________________________________
(page generated 2024-01-15 23:00 UTC)