[HN Gopher] (Unsuccessfully) Fine-tuning GPT to play "Connections"
       ___________________________________________________________________
        
       (Unsuccessfully) Fine-tuning GPT to play "Connections"
        
       Author : danielcorin
       Score  : 63 points
       Date   : 2024-01-15 17:02 UTC (5 hours ago)
        
 (HTM) web link (www.danielcorin.com)
 (TXT) w3m dump (www.danielcorin.com)
        
       | epiccoleman wrote:
       | This is pretty interesting. Intuitively, Connections is the kind
       | of thing I would expect GPT to _not_ be good at, because almost
       | every day there 's something that feels kind of "out of left
       | field" in the categories. In my experience LLMs are good at
       | regurgitating the "standard" take on a topic, or "best
       | practices", but lack the creativity and out-of-the-box thinking
       | that makes Connections fun.
       | 
       | On the other hand, it feels like the kind of thing where an LLM
       | might be _surprisingly_ good, because it could, in theory, be
       | able to see more correlations than a human can. Based on these
       | results I guess my intuition seems to hold up.
       | 
       | I wonder if a better / different way to approach this could be
       | more "algorithmically" - maybe have the LLM generate a list of
       | possible categories for each individual word and then try to
       | operate on those associations?
       | 
       | Cool article!
        
         | matsemann wrote:
         | The "whole point" of embeddings is that words have a vector
         | that represents how well that word fits into a certain
         | categories, so words belonging together is close in that vector
         | space. So in that sense it almost feels like this should be
         | solvable using something simpler than a full LLM. To "just" get
         | the embeddings of the words, and then find the groups of 4 that
         | minimizes the total distances within the groups.
        
           | basil-rash wrote:
           | The problem is Connections is designed to use a tons of
           | alternate definitions and other vaguities that aren't well
           | modeled in typical embeddings. Today's for instance
           | (spoilers!!) has Coat, Green, Pod, and Soup as being linked
           | for them matching "Pea ___". No embedding would relate them
           | at all, unless that suffix is known a priori.
        
       | RheingoldRiver wrote:
       | This game looks cool but wow the UX is terrible. Why can't you
       | click & drag the words to reorder them? Seems half the difficulty
       | is keeping track of your thought process with the inability to
       | make a draft state.
        
         | pmelendez wrote:
         | >Why can't you click & drag the words to reorder them? That
         | level of difficulty is part of the game. There is a shuffle
         | button to ease the ideas generation but most likely is was done
         | like that by design.
        
         | matsemann wrote:
         | Also, there is no reason to limit the amount of guesses. Just
         | let me try until I figure it out. But no, they've put a limit
         | so that it can be "sharable" in a tweet-sized text, to try to
         | copy the viralness of Wordle. But they do it to the detriment
         | of the gameplay, in such a way that I don't even bother
         | playing.
        
           | krainboltgreene wrote:
           | The whole point of the game is to do it within a bounded set
           | of moves.
        
             | matsemann wrote:
             | No, it scores you based on the number of moves used. No
             | need for an upper bound, could've let me use 20 guesses if
             | that's what it takes (non native speaker). But that
             | wouldn't fit their copy&paste result formatting..
        
           | n2d4 wrote:
           | Disagreed, the limit is what gives the game a constraint and
           | makes it interesting IMHO. I like to have something that
           | makes me fail because I care less about optimizing a score,
           | more about beating it in the first place. Different people
           | play games differently, etc.
           | 
           | I also don't see how it makes it "sharable". Wouldn't it be
           | more sharable if they let everyone win and just give them a
           | score?
        
           | aidos wrote:
           | The format in the show it's lifted from (Only Connect -
           | greatest game show ever) is that the teams have 2 minutes
           | total to solve the "connecting wall". They can have as many
           | guesses as they want until they solve the first two groups -
           | after that it's 3 strikes and you're out.
        
       | charcircuit wrote:
       | It's a shame you can't just see the probability distributions for
       | the 16 words and choose them yourself that way you never
       | hallucinate a word and the groups are always 4 words long.
        
       | meatmanek wrote:
       | I've played around with the same problem, though I didn't do any
       | fine-tuning. Some strategies that seemed promising:
       | - A two-pass approach where you prompt it to generate the
       | groupings, then separately prompt it to find the words that
       | belong into each group. (Which of the following words best fit
       | the category "COMMON DOG NAMES"?). It does way better at the more
       | specific queries.       - Don't tell it the constraints of 4
       | groups of 4 words; ask it for at least four groups of 2 or more
       | words. Once you have 4+ groups of 4+ words, you can make logical
       | inferences with your Python wrapper to come up with all the
       | possible 4x4 groupings. If you're lucky there will only be one.
       | If not... more queries to GPT, I guess, but I haven't figured
       | this part out.
        
       | xnorswap wrote:
       | This game is well known in the UK as the "Connecting Wall" from
       | Only Connect.
       | 
       | This result - poor Chat GPT performance - surprises me. I thought
       | pattern detection and set forming was something that Chat GPT
       | could do well. Perhaps it would need a model to be specifically
       | trained for this task. If alpha-zero can master chess, then
       | surely this game isn't beyond what is trainable.
       | 
       | You can prompt Chat GPT that it'll be playing the connecting wall
       | without having to explain the game. It still fails to make a good
       | set of connections when provided the wall.
       | 
       | One interesting part of the "Connecting Wall" sets is that there
       | is almost always a "Wordy one" involving changing a letter,
       | adding a prefix, anagrams, etc. Almost always a "Person" one for
       | example there'll be a set of "Famous people named Tom..." but not
       | a set of "Toms" with a set of "Margarets", and then a couple of
       | general sets.
       | 
       | This is a huge help given the 2 minutes and 30 seconds provided.
       | 
       | On another note, it's possible that the GCHQ puzzle book would be
       | in the training set, which has many puzzles with solutions for
       | this format and a very similar rubrik with 55 items and sets of
       | sizes 1 through 10. That said, Chat GPT perhaps would not tie the
       | answers in the back of the book to the solutions in the front.
       | 
       | I all, I think an AI trained for this purpose with problems and
       | given solutions ought to end up mastering this format. But a
       | general purpose chat GPT seems like it performs very badly.
        
         | archgoon wrote:
         | Chess has a well defined set of correct solutions. The rules
         | are well known and understood.
         | 
         | Connections is much less so.
        
         | monsieurbanana wrote:
         | Neophyte question:
         | 
         | Can we infer anything about what llm's can achieve from what we
         | can achieve with AIs like AlphaGo? I thought their approaches
         | were completely separated
        
           | themoonisachees wrote:
           | Not really;
           | 
           | Gpts are a class of text predictors. Ultimately they are
           | ranked on whether or not the output is similar to the
           | training data, text-wise. If the training data included a
           | game then it may be able to play that game, but only if that
           | game requires reasoning about entire words (because of
           | tokenization, gpts can't reason in terms of letters, that's
           | why they do poorly at crosswords for example)
           | 
           | On the flip side, alphazero is a class of networks that have
           | a list of actions they can take, and a list of parameters
           | they observe about the game (in chess: the board position, in
           | other games: their position on screen, score, speed, etc).
           | The model is then trained to take actions that maximize an
           | actual hard value from the game, like winning a game of
           | chess, capturing a piece, increasing a score, driving the
           | furthest.
           | 
           | In theory you could train a model with the alphago method to
           | do text prediction, but LLMs are called "large" for a reason,
           | the input and output spaces would have to be the number of
           | possible tokens (and at that point just train a normal gpt,
           | it's much more efficient). Also in theory you could train a
           | gpt to play games, but you're spending huge amounts of
           | compute evaluating extraneous words in the input (the prompt)
           | and the output (most words do not have anything to do with
           | your game). on top of that, you're iterating over every word
           | you generate to generate the next one, so you're doing
           | multiple passes of this largely infficient computing, which
           | means you're slower compared to a tailor-made model that can
           | evaluate one situation once and give you a list of outputs to
           | perform.
           | 
           | in this specific case it's a bit wierd because the input
           | space for the alphazero model would have to be every word
           | that can appear on the board, but the reasoning part is most
           | likely not a problem given enough model size. since it's
           | competing with a multi-gigabyte llm though, there is space to
           | spare.
        
         | jw1224 wrote:
         | > This result - poor Chat GPT performance - surprises me. I
         | thought pattern detection and set forming was something that
         | Chat GPT could do well
         | 
         | I would speculate it's struggling because of the linear nature
         | of its output, and the red-herring words which crossover
         | between categories.
         | 
         | Because the model can't "look ahead", it starts spitting out
         | valid combinations, but without being able to anticipate that
         | committing to a certain combination early on will lead to a
         | mistake later.
         | 
         | I expect if you asked it to correct its output in a followup
         | message, it could do so without much difficulty.
        
         | mtlmtlmtlmtl wrote:
         | Not sure how Alpha Zero is relevant to whether a transformer
         | can play connections. Alpha zero is not a transformer and chess
         | is not connections.
        
         | firebaze wrote:
         | ChatGPT4 solved today's riddle in the first try for me.
         | Caution, spoilers ahead:
         | https://chat.openai.com/share/0c40a0b5-ab8f-4094-a7cc-21bb94...
         | 
         | (it even ignored some embarrassing typos ...)
        
           | mbb70 wrote:
           | Doesn't this list the words in the order that they are
           | grouped? The article states that randomizing the words
           | completely eliminates any successful results
        
           | jph00 wrote:
           | It didn't solve it -- instead it simply created groups in the
           | exact order you provided.
        
       | benpacker wrote:
       | I unfortunately can't imagine having time to test this, but I
       | imagine there may be a way to accomplish this with embeddings.
       | 
       | The game itself is sort of an embeddings clustering problem, with
       | the added difficulty that each group needs to only be alike in 1
       | way (versus a full vector distance which measures how alike they
       | are in _every_ way).
       | 
       | Maybe there is some way to search for a vector of weights, which,
       | when multiplied by all members of a group of 4, produces weighted
       | vectors with the least distance from their center? And then it's
       | an optimization problem to find the 4 groups that find minimize
       | the total distance from each groups center?
       | 
       | It may be possible to find a weight vector that selects for a
       | particular slice of a words meaning
        
         | noxvilleza wrote:
         | That approach works well for a game like [Codewords](https://en
         | .wikipedia.org/wiki/Codenames_(board_game)) where you're trying
         | to find a single-word common hint between many of your words
         | (that doesn't hit any of the other words).
         | 
         | My feeling is that it'll struggle with word-plays in
         | OnlyConnect/Connections (like missing letters, added letters,
         | words-within-words homophones, etc) as well as two-step
         | references (such as {Venice, Dream, Night, Nothing} => "last
         | words of Shakespeare plays"}).
        
       | jappgar wrote:
       | I've found that for tasks like this you need to ask the model to
       | output its reasoning prior to outputting the "answer".
       | 
       | See "chain of thought" e.g
       | 
       | https://arxiv.org/abs/2201.11903
        
       | orzig wrote:
       | Love a published null result!
        
       | amayne wrote:
       | You'll _probably_ get better results by putting the examples only
       | in the completion part of the training examples.
       | 
       | GPT-3.5 learns how to generalize better when it's just in the
       | completion.
       | 
       | This is the same problem that vexed the researchers who did the
       | paper on the alleged reversal curse.
       | 
       | (https://andrewmayne.com/2023/11/14/is-the-reversal-curse-rea...)
        
       | daemonologist wrote:
       | I suspect the inability of the model to "plan ahead" is a
       | significant contributor to its poor performance relative to a
       | human. Being able to check a grouping to be sure it includes at
       | least four words _and_ to check that it doesn't conflict with the
       | other three groupings is a major advantage - it's pretty common
       | that these puzzles include partial or incompatible red herring
       | groups.
       | 
       | If this is the case, performance might be improved by taking the
       | final solving responsibility away from the model and giving it to
       | the script. You could ask GPT for categories, ask whether each
       | word fits each category (discarding categories with fewer than 4
       | words), and then search for 4 non-overlapping categories.
       | 
       | (This might be missing the point of the exercise though.)
        
       | selimthegrim wrote:
       | For a second I thought James bloody Burke was going to enter the
       | chat
        
         | esafak wrote:
         | What would be bad about that? His Connections documentary was
         | great. And apparently rebooted last year:
         | https://curiositystream.com/video/6468
        
       | binarymax wrote:
       | You need to model how a person actually plays connections. Start
       | with the most obvious group that has the least ambiguity, and
       | then your problem space is smaller on category 2, then the same
       | for category 3 and 4.
       | 
       | So really you could fine tune 3 models - one for 16 words, one
       | for 12, and one for 8. Then use them in succession.
       | 
       | Also, if you come across a mistake at the end (have some negative
       | examples in the training sets), tell it to start over and add to
       | the prompt what you think is NOT a group.
        
         | thomasahle wrote:
         | It might even be easier to pick an arbitrary word, and ask it
         | to find the three that matches it.
         | 
         | Asking GPT to just pick any group, adds a lot of extra "mental
         | overhead".
         | 
         | Though of course this works best if all the groups are roughly
         | of the same difficulty.
        
       | s1mon wrote:
       | I spent a bunch of time manually just using GPT4 with fairly
       | simple prompts and giving it the same feedback that the game
       | gives. There's an archive of puzzles which I used to try to train
       | it with, and sometimes it would be very successful, and sometimes
       | it was frustrating how bad it was at doing basic things like
       | keeping track of what words it had used so far. Each day I would
       | also have it play the new puzzle from the NYTimes which it
       | couldn't have trained on. Some days it did perfectly some it made
       | really stupid mistakes. It seems like a more concerted effort
       | could achieve better results.
        
       ___________________________________________________________________
       (page generated 2024-01-15 23:00 UTC)