hngopher.com

       [HN Gopher] OpenAI's o1 Playing Codenames
       ___________________________________________________________________
        
       OpenAI's o1 Playing Codenames
        
       Author : suveen_ellawela
       Score  : 195 points
       Date   : 2025-01-22 06:21 UTC (3 days ago)
        
 (HTM) web link (suveenellawela.com)
 (TXT) w3m dump (suveenellawela.com)
        
       | JaggerFoo wrote:
       | I did this with Claude over the holidays. Putting Claude in the
       | role as a guesser and comparing the guess to another experience
       | human player. It turns out they both matched each other.
        
         | suveen_ellawela wrote:
         | That's a nice experiment! I think codenames could definietly be
         | an evaluation method for LLMs.
        
           | pieix wrote:
           | Elo on different card games/board games would be a great eval
           | metric now that the systems are general enough to play
           | Codenames, chess, poker...
        
           | __MatrixMan__ wrote:
           | It would be fun to build one, perhaps mediated by an app,
           | where you have to guess whether your spymaster is a human or
           | an AI based on the quality of their choices.
        
             | zeroonetwothree wrote:
             | The average human is quite bad. It really works well when
             | the spymaster is (a) experienced and (b) familiar with the
             | other players.
        
               | __MatrixMan__ wrote:
               | It's the (b) case I'm interested in. Like the spymaster
               | loses if they can't subtly indicate to their friends that
               | they're the real deal. Otherwise the robots win.
        
       | joaomacp wrote:
       | I tried whatever the multi-modal paid ChatGPT model is on the
       | Codenames Pictures version, and it didn't fare that well. Since
       | they will probably scrape this comment and add it to next model's
       | training data, I look forward to it getting good!
        
       | kennyloginz wrote:
       | Could this just be a case of Reddit being included in the
       | training data?
       | 
       | " I read through codenames official rules to see if using "007"
       | as a clue was allowed, and it turns out it is! To my surprise, I
       | even came across a Reddit post where people were discussing and
       | justifying why this clue fits perfectly within the rules."
        
         | JohnMakin wrote:
         | Yea, initially I thought this post was satire because of this.
        
       | tsroe wrote:
       | Fun quirk about this game: If there aren't too many cards left
       | and your teammate knows their powers of two, you have a winning
       | strategy. You simply lay a mental bitmap over all remaining
       | cards, setting 1 for cards that belong to your team and 0 for all
       | others. You can then just say the number that is represented by
       | this bitmap, e.g. "five" for 0101, and your teammate can decode
       | it in their head. All numbers are, after all, single words. This
       | means, if you are very good at mental maths or you allow for a
       | calculator, you could also win every game in the first round. For
       | me personally however, it only becomes feasible with around 10
       | cards remaining.
        
         | RedNifre wrote:
         | That's against the rules.
        
         | Klaster_1 wrote:
         | Guys I was playing with declared a similar move against the
         | rules, so it was back to the old latent space search.
        
           | Smaug123 wrote:
           | It is _explicitly_ against the rules
           | (https://czechgames.com/files/rules/codenames-rules-en.pdf),
           | so they were correct. "Your clue must be about the meaning of
           | the words. You can't use your clue to talk about the letters
           | in a word or its position on the table."
        
         | andrepd wrote:
         | This is explicitly against the rules.
        
         | tweakimp wrote:
         | What if the game showed a different order of cards to every
         | player?
        
           | wccrawford wrote:
           | Because the original was a tabletop game, it can't.
           | 
           | The digital version could and _should_ do this, IMO. (I don
           | 't actually know if it does, though, as I've only played the
           | digital version a few times.)
        
       | thrance wrote:
       | I mean, it's playing against itself, not really a fair comparison
       | to humans in my mind. The fun and hard part of this game is to
       | get into your teammates brains and decipher what they possibly
       | meant with what they played.
        
       | xnickb wrote:
       | Somehow I expected AI to give clues that combine 4-5-6 words at a
       | time. It's not at all impressive to me. And I'm not a serious
       | player at all
        
         | pama wrote:
         | I was wondering about the same. It is possible that the
         | instructions didn't try to make the gameplay as aggressive as
         | possible. A good model could optimize the separator to make it
         | easy to guess the most words possible. By having access to its
         | own state, it should be possible to reach 5-6 words in most
         | cases. There is an argument for keeping words around that would
         | increase the difficulty of the opponents guessing large/clean
         | separations, so it is possible that optimal play includes
         | simple pairs on occasion. Very interesting application
         | nonetheless.
        
           | vitus wrote:
           | > It is possible that the instructions didn't try to make the
           | gameplay as aggressive as possible.
           | 
           | In case you're wondering, the prompts are available here:
           | https://github.com/SuveenE/codenames-
           | ai/blob/main/utils/prom...
        
         | vitus wrote:
         | I am similarly less-than-impressed. If you click through to the
         | website, you can watch the replay of one of the games mentioned
         | in the article (the one with the clue "invader").
         | 
         | In that instance, the clues all matched 2-3 words, and the
         | winning team got lucky twice (they guessed an unclued word
         | using an unintended correlation, and their opponent guessed a
         | different one of their unclued words.)
         | 
         | You also see a number of instances where the agents continue
         | guessing words for a clue even though they've already gotten
         | enough matches. For instance, in round 2, for the clue "Japan
         | (2)", the blue team guesses sumo and cherry, then goes for a
         | rather tenuous followup guess for round 1's 007 with "ring"
         | (despite having gotten the two clued matches in the first
         | round). A sillier example is in the final round, where the Red
         | Team guesses 3 clues (thereby identifying all nine of their
         | target words), then going ahead and guessing another word.
         | 
         | (For what it's worth, I think "shark" would have been a better
         | guess for another 007 tie-in seeing as there are multiple Bond
         | movies with sharks, but it's also not a match, and again, I
         | wouldn't have gone for a third guess here when there were only
         | two clued words.)
        
           | garretraziel wrote:
           | This is allowed by the rules though. You can guess +1 to the
           | number specified.
        
             | topaz0 wrote:
             | They know it's allowed. It's also terrible and non-sensical
             | strategy in the specific cases that are described.
        
         | wwtl12 wrote:
         | The Mechanical Turk is super impressive if you don't know how
         | it works.
        
       | croes wrote:
       | Is that really surprising?
       | 
       | It's basically the same brain playing with itself. Seems quite
       | natural to link the code names to the same words.
       | 
       | Let different LLMs play.
        
         | deredede wrote:
         | This is the take I thought I'd have, but in the last example,
         | the guesser model reaches the correct conclusion using a
         | different reasoning than the clue giver model.
         | 
         | The clue giver justifies the link of Paper and Log as "written
         | records", and between Paper and Line as "lines of text". But
         | the guesser model connects Paper and Log because "paper is made
         | from logs" (reaching the conclusion through a different meaning
         | of Log), and connects Paper and Line because "'lined paper' is
         | a common type of paper".
         | 
         | Similarly, in the first example, the clue giver connects
         | Monster and Lion because lions are "often depicted as a
         | mythical beast or monster in legends" (a tenuous connection if
         | you ask me), whereas the guesser model thought about King
         | because of King Kong (which I also prefer to Lion).
        
           | unlikelymordant wrote:
           | generally there is a "temperature" parameter that can be used
           | to add some randomness or variety to the LLMs outputs by
           | changing the likelihood of the next word being selected. This
           | means you could just keep regenerating the same response and
           | get different answers each time. each time it will give
           | different plausible responses, and this is all from the same
           | model. This doesn't mean it believes any of them, it just
           | keeps hallucinating likely text, some of which will fit
           | better than others. It is still very much the same brain (or
           | set of trained parameters) playing with itself.
        
           | wizzwizz4 wrote:
           | > _But the guesser model connects Paper and Log because
           | "paper is made from logs" (reaching the conclusion through a
           | different meaning of Log)_
           | 
           | No, it doesn't. It reaches the conclusion because of vector
           | similarity (simplified explanation): these explanations are
           | _post-hoc_.
        
             | Angostura wrote:
             | Sorry, I'm uninformed. Do you mean thaw the explanation
             | could be completely unrelated to the _actual_ "reason"
        
               | DiscourseFan wrote:
               | Yes, the reason is that the model assigns words positions
               | in an ever-changing vector space and evaluates relation
               | by their correspondence in that space--the reply it gives
               | is also a certain index of that space, with the "why" in
               | the question giving it the weight of producing an
               | "answer."
               | 
               | Video series on the topic:
               | https://www.3blue1brown.com/topics/neural-networks
               | 
               | Which is to say that "why" it gives those answers is
               | because its statistically likely within its training data
               | that when there are the words, "why did you connect line
               | and log with paper" the text which follows could be "logs
               | are made of wood and lines are in paper." But that is not
               | the specific relation of the 3 words in the model itself,
               | which is just a complex vector space.
        
               | jprete wrote:
               | I definitely think it's doing more than that here (at
               | least inside of the vector-space computations). The model
               | probably directly contains the paper-wood-log
               | association.
        
               | jncfhnb wrote:
               | If an LLM states an answer and then provides a
               | justification for that answer, the justification is
               | entirely irrelevant to the reasoning the bot used. It
               | might be that the semantics of the justification happen
               | to align with the implied logic of the internal vector
               | space, but it is best case a manufactured coincidence.
               | It's not different from you stating an answer and then
               | telling the bot to justify it.
               | 
               | If an LLM is told to do reasoning and then state the
               | answer, it follows that the answer is basically
               | guaranteed to be derived from the previously generated
               | reasoning.
        
               | ActivePattern wrote:
               | The answer will likely match what the reasoning steps
               | bring it to, but that doesn't mean the computations by
               | the LLM to get that answer are necessarily approximated
               | by the outputted reasoning steps. E.g. you might have an
               | LLM that is trained on many examples of Shakespearean
               | text. If you ask it who the author of a given text is, it
               | might give some more detailed rationale for why it is
               | Shakepeare, when the real answer is "I have a large prior
               | for Shakespeare".
        
             | lmm wrote:
             | > these explanations are post-hoc.
             | 
             | The best available evidence suggests this is also true of
             | any explanations a human gives for their own behaviour;
             | nevertheless we generally accept those at face value.
        
               | chongli wrote:
               | Of course! If you've played Codenames and introspected on
               | how you play you can see this in action. You pick a few
               | words that feel similar and then try to justify them.
               | Post-hoc rationalization in action.
        
               | topaz0 wrote:
               | Except you also examine the rationalization as part of
               | deciding whether to act on the impulse or not.
        
               | chongli wrote:
               | Yes and you may search for other words that fit the
               | rationalization to decide whether or not it's a good one.
               | You can go even further if your teammates are people you
               | know fairly well by bringing in your own knowledge of
               | these people and how they might interpret the clues.
               | There's a lot of strategy in Codenames and knowledge of
               | vocabulary and related words is only part of it.
        
               | wizzwizz4 wrote:
               | The explanations I give of my behaviour are post-hoc
               | (unless I was paying attention), but I also assess their
               | plausibility by going "if this were the case, how would I
               | behave?" and seeing how well that prediction lines up
               | with my _actual_ behaviour. Over time, I get good at
               | providing explanations that I have no reason to believe
               | are false - which also tend to be explanations that allow
               | other people to predict my behaviour (in ways I didn 't
               | anticipate).
               | 
               | GPT-based predictive text systems are _incapable_ of
               | introspection of any kind: they _cannot_ execute the
               | algorithm I execute when I 'm giving explanations for my
               | behaviour, nor can they execute _any_ algorithm that
               | might actually result in the explanations becoming or
               | approaching truthfulness.
               | 
               | The GPT model is describing a fictional character named
               | ChatGPT, and telling you why ChatGPT thinks a certain
               | thing. ChatGPT-the-character is _not_ the GPT model. The
               | GPT model has no conception of itself, and cannot ever
               | possibly develop a conception of itself (except through
               | philosophical inquiry, which the system is incapable of
               | for _different_ reasons).
        
             | DominikPeters wrote:
             | This is o1 so it need not be post hoc but the result of
             | reasoning about several possible choices and explanations.
        
         | ushiroda80 wrote:
         | Yeah not sure what's impressive about this. Having the model be
         | both the guesser and clue giver will of course have good
         | results as it's simply a reflections of o1's weighting of
         | tokens.
         | 
         | Interestingly this could be a way to potentially reverse
         | engineer o1's weightings
        
         | elicksaur wrote:
         | Or, have it play a human and compare human-human and llm-human
         | pairs.
        
       | fercircularbuf wrote:
       | I've intuitively felt that this general class of task is what
       | these LLMs are absolutely best at. I'm not an expert on these
       | things, but isn't this thanks to word embeddings and how words
       | are mapped into high dimensional vector space within the model? I
       | would imagine that because every word is mapped this way, finding
       | a word that exists in the same area as mail, lawyer, log, and
       | line in some vector space would be trivial for the model to do,
       | right?
        
         | infinitifall wrote:
         | More than just words. I've found LLMs immensely helpful for
         | searching through the latent space or essence of
         | quotes/books/movies/memes. I can ask things like "whats that
         | book/movie set in X where Y happens" or "whats that quote by a
         | P which goes something like Q" in my own paraphrased way and
         | with a little prodding, expect the answer. You'd have no luck
         | with traditional search engines unless someone has previously
         | asked a similar question.
        
       | captn3m0 wrote:
       | I've been trying to do this with just word2vec, instead of
       | throwing an LLM, since you just need to find a word with the
       | appropriate distances optimized.
       | https://github.com/captn3m0/ideas?tab=readme-ov-file#codenam...
        
         | dartos wrote:
         | I love this.
         | 
         | Imagine the energy savings if more people didn't just
         | automatically reach for LLMs for their pet projects.
        
         | zeroonetwothree wrote:
         | I tried this many years ago (before LLMs) with hundreds of real
         | human games and it was never that good.
        
         | qqqult wrote:
         | I did that last summer, I compared the performance of different
         | english word embedding models, as far as I remember the best
         | ones were GloVe and a few knowledge graph word embeddings.
         | 
         | None of them were better than a human at giving hints for 3+
         | words though
        
       | tweakimp wrote:
       | It would be really interesting to see an LLM watch other players
       | and learn how they think to find the best clues THEY need to hear
       | to find the right words.
        
       | progrus wrote:
       | GPT-3 was superhuman at this too
        
       | sylware wrote:
       | If it can port c++ to C99+ and write correct 64bit risc-v
       | assembly...
        
       | jprete wrote:
       | Codenames is absolutely dead-center of what I expect _Large
       | Language Models_ to be good at. The fundamental skills of the
       | game are: having an excellent embedding for word semantics and
       | connotations; modeling other people 's embeddings; a little bit
       | of game strategy related to its competitive nature.
        
       | badgersnake wrote:
       | Or just play with your friends?
        
       | zeroonetwothree wrote:
       | I don't find this "super good". It's mostly giving 2 clues which
       | is the most basic level of competence. The paper 4 clue is
       | reasonable but a bit lucky (eg Jack is also a good guess). I also
       | don't see it actually using tactics properly, which I would
       | consider part of being "super good". The game isn't just about
       | picking a good clue each round!
       | 
       | Now obviously it's still pretty decent at finding the clues.
       | Probably better than a random human who hasn't played much. Just
       | I find the post's level of hype overstated. It feels like the
       | author isn't very experienced with Codenames.
       | 
       | It would be interesting to compare AI:human vs human:human games
       | to see which does better. It seems like AI:AI will overstate its
       | success.
        
         | groggo wrote:
         | Can you elaborate on some of the more advanced tactics?
         | 
         | When I play, it's mostly about getting a good 2 clue each time.
         | Then if you can opportunistically get a 3 or 4, that's awesome.
         | 
         | Some tactics come in for choosing the right pairs of 2's so you
         | don't end up mismatched, or leaving clues that might be
         | ambiguous with your opponent's... But that's mostly it.
         | 
         | It'll be fun for multiplayer! Just like how in other online
         | games you can add in a AI to play as one of the players.
        
           | mtmickush wrote:
           | Other advanced tactics involve giving a broad clue that
           | matches 3-4 of your own and just one other (either your
           | opponents or a civilian). Your team can pick up all the
           | matches across several turns and the one off doesn't hurt as
           | much as the plus four helps
        
             | hunter2_ wrote:
             | The S-tier tactic: When that high-number clue is cut short
             | by a turn-ending mistake, the guessers tell their clue
             | giver to inflate the number given during the totally
             | unrelated next clue by however many remained from the
             | truncated turn for which they don't need additional
             | information to locate (and therefore it would be wasteful
             | for a future clue to re-group those) so the stated number
             | of that next clue must allow for its own cards plus the
             | prior cards.
             | 
             | Example: The clue is "places 4" and the guessers choose 1
             | correctly and then 1 wrong answer, but they had achieved
             | consensus about 2 others (and are confused about only the
             | remaining 1). So the turns ends but they inform the clue
             | giver to inflate by 2 next turn. That clue giver (after the
             | other team goes) will then say the clue is "people 5" and
             | the guessers will know that they shall select 2 places and
             | 3 people.
             | 
             | This can cascade beyond just a pair of turns.
        
               | ruds wrote:
               | I don't think this sort of communication from guessers to
               | clue giver is in the spirit of the game (at least in my
               | play group). However, inflating later clues is a
               | reasonable approach! It's just that I don't think you're
               | allowed to communicate the amount of inflation. Guessers
               | must determine whether people 5 has slack to allow
               | additional guesses on previous clues.
        
               | hunter2_ wrote:
               | You're free to add additional prohibitions on
               | communication as a house rule I guess, but the only
               | prohibition in the rule book I've seen is that the clue
               | giver's speech must consist exclusively of clues (and
               | private consultation with the other clue giver). The clue
               | giver is free to adjust their clue in reaction to
               | anything they hear, and guessers can speak freely.
               | 
               | Important: the clue giver cannot acknowledge the
               | instruction during gameplay. That would certainly extend
               | beyond giving a clue! The guessers must know that their
               | clue giver can play this way prior to the game
               | commencing.
               | 
               | Edit: I just consulted the rules and this is the most
               | relevant section:
               | 
               | > If you are a field operative, you should focus on the
               | table when you are making your guesses. Do not make eye
               | contact with the spymaster while you are guessing. This
               | will help you avoid nonverbal cues.
               | 
               | > When your information is strictly limited to what can
               | be conveyed with one word and one number, you are playing
               | in the spirit of the game.
               | 
               | The author's use of the pronoun "you/your" switches from
               | field ops in that first paragraph to spymasters in that
               | second paragraph, confusingly. With that in mind, it
               | boils down to this: field ops cannot seek non-clue
               | information from spymasters, and spymasters cannot convey
               | non-clue information. The strategy I'm suggesting
               | involves neither!
        
               | ALittleLight wrote:
               | If you take this idea of communication restrictions to
               | the limit, you could imagine the guessers identifying N
               | sets of cards by a single word each as they discuss their
               | guess. The clue giver listens, then uses the clue that
               | identifies the correct set of N cards.
               | 
               | You really just need an algorithm to generate unique sets
               | of 8 or 9 from the whole board, and identifies those sets
               | by a word.
        
               | groggo wrote:
               | Yeah it's interesting to take these ideas to the
               | extreme... even at the lower end I don't like it, I think
               | zero communication outside of clues is the best way to
               | follow the spirit of the game. But a little bit of banter
               | and "kibitzing" is what makes it fun too.
        
               | ta_1138 wrote:
               | The communication is only necessary/important if people
               | haven't set this as a convention in the first place. I'll
               | say that prior to ever looking at my clues: "I will give
               | you higher numbers than what I said if you miss by more
               | than 1. THe number I pick will always be high enough as
               | to allow you to, with the +1 guess you get for free, make
               | guesses on all the words I was hinting at.
               | 
               | There's also all kinds of not necessarily intended
               | communicaton from the guessers in the fact that you can
               | listen to which words they were considering and didn't
               | pick. Nothing in the game attempt to say that you should
               | not consider, say, whether they were going in the right
               | or wrong direction in their guessing, but it sure can
               | make a difference in how to approach later clues. If they
               | were being very wrong, there might be a need to double up
               | on words that you intended, and that your guessers
               | missed.
               | 
               | In the same fashion, nothing in the game saying that I
               | cannot listen to those guesses as a member of the other
               | team, whether guesser or spymaster, and then change
               | behaviors to make sure we don't hit words they considered
               | as candidate words without very good reasons. Let them
               | double dip on mistakes, or not make their difficult
               | decisions easier. It's not as if the game demands that
               | everyone that isn't currenly guessing should wear
               | headphones to be sure they disregard what the other team
               | says or does.
        
               | foota wrote:
               | You can of course play however you want (and I certainly
               | think this is clever), but imo this is likely against the
               | spirit, and perhaps letter, of the rules.
               | 
               | The rule on giving clues is:
               | 
               | "If you are the spymaster, you are trying to think of a
               | one-word clue that relates to some of the words your team
               | is trying to guess. When you think you have a good clue,
               | you say it. You also say one number, _which tells your
               | teammates how many codenames are related to your clue_. "
               | (emphasis mine).
               | 
               | The rule states that the number should be the number of
               | words related to the clue. There is later provisions
               | _allowing_ you to use zero and infinity, but outside of
               | these carve-outs (and imo the  "allowed" language is
               | telling here, since it implies any other number not equal
               | to the number of words is not allowed) I don't think this
               | is legal.
        
             | lostlogin wrote:
             | > the one off doesn't hurt as much as the plus four helps
             | 
             | Doesn't the turn end if you hit the opponents word?
        
               | topaz0 wrote:
               | Yes, but they can go back for those words in future
               | rounds
        
           | jncfhnb wrote:
           | If you really want to get good, your goal is not so much to
           | get as many tiles as possible, but rather to get the tiles
           | that are semantically distinct from your opponent's. A single
           | mistake that triggers your opponent's tile is generally
           | enough to lose the game. And even if they don't do it, having
           | them uncover the tiles from their side that are semantically
           | similar to your own team is also useful.
           | 
           | If you want to get nasty, you learn to abuse the fact that
           | the tile layouts follow rules and that you can rule out
           | certain tiles without considering the words.
        
             | groggo wrote:
             | Memorizing the tile layouts is too much for me haha (imo
             | against the spirit of the game). I usually play online now
             | anyway so I hope they don't follow those same patterns as
             | the physical version.
        
             | bentcorner wrote:
             | > you learn to abuse the fact that the tile layouts follow
             | rules and that you can rule out certain tiles without
             | considering the words.
             | 
             | Can you clarify? Isn't the card placement random?
        
               | jncfhnb wrote:
               | It's randomish. There are facts about the possible
               | layouts you can memorize.
               | 
               | I don't know most of these rules but there's never 5 in a
               | row; or even 4 in a row if you're the team with one fewer
               | (second team to play).
               | 
               | Edit: because the game layout is determined by choosing
               | one of a few dozen possible layout cards and randomly
               | rotating it
        
               | tveita wrote:
               | There are 40 setup cards with 4 possible rotations that
               | specify agent placements, so it's theoretically possible
               | to do some kind of memorization.
               | 
               | Personally I'd find that kind of play style very unfun,
               | and would rather switch to fully randomized boards if I
               | played enough that it became a problem.
               | 
               | https://danluu.com/codenames/
        
           | harrall wrote:
           | I find the game more about reading the people on your team
           | (and the other team) to understand how they think.
           | 
           | You have to give entirely different clues depending on the
           | people you play with.
           | 
           | Sometimes you can also play adversarial and introduce doubt
           | into the opposing team by giving topic-adjacent clues that
           | cause them to avoid one of their own cards. It works better
           | if someone on the other team tends to be a big doubter. It
           | also can work when the other team constantly goes back and
           | tries to pick n+1 cards that they think they missed from the
           | last round, which gives you a lot of room to psychologically
           | mess with them.
           | 
           | Sometimes you have a clue that only really matches 2, but
           | because only 1 of the wrong matches is a neutral card and you
           | could match 2 more by a massive stretch, you say "4." Worse
           | case, they get 2 right but then they pick the neutral card
           | but in the best case, you stand to gain 4 for a clue that
           | should only match 2.
           | 
           | I like Codenames because they are many meta ways to play the
           | game. What makes Codenames unique is that, unlike a lot of
           | other games (Catan, Secret Hitler, CAH, etc.), it's an
           | adversarial team game where the team dynamics and discussions
           | are not secret so you can use them to your advantage.
        
           | blix wrote:
           | experienced players who know their teammates well can
           | reliably get 3-4s. if you only go for safe 2s against these
           | opponents you will lose every time.
        
         | dang wrote:
         | Ok, we've taken supergoodness out of the title now. Presumably
         | the post is still interesting!
         | 
         | (Submitted title was "I got OpenAI o1 to play the boardgame
         | Codenames and it's super good".)
        
       | jsemrau wrote:
       | I have been doing some experiments with Agents, Reinforcement
       | Learnings playing a 4x4 Tic Tac Toe game.[1]. Given my analysis
       | of the "thought" process we are still really far from true
       | understanding of such games. While in my game as well as OP"s,
       | the rules are pre-trained and the models are good enough to reach
       | a conclusion (which in itself is already impressive), it is still
       | a long way.
       | 
       | [1] https://jdsemrau.substack.com/p/nemotron-vs-qwen-game-
       | theory...
        
       | lolinder wrote:
       | A small weakness in this test is that one of the keys to
       | strategic Codenames play is understanding your partner. You're
       | not just trying to connect the words, you're trying to connect
       | them in a way that will be obvious to your partner. As a
       | computing analogy: you're trying to serialize a few cards in a
       | way that will be deserializable by the other player.
       | 
       | This test pairs o1 with itself, which means the serializer _is_
       | the deserializer. So while it 's impressive that it can link 4
       | words, most humans could also easily link 4 with as much
       | stretching! We just don't tend to because we can't guarantee that
       | the other human will make the same connections we did.
        
         | ModernMech wrote:
         | lol I played this game with my family and they said my wife and
         | I were cheating because I kept using inside jokes that made no
         | sense to them but she would get immediately.
        
           | dgritsko wrote:
           | That's a big part of what makes this game enjoyable - a clue
           | that is very obvious to one person might not even cross the
           | mind of someone else. To anyone reading this who hasn't
           | played, it's definitely worth giving it a try.
        
             | slyn wrote:
             | Agreed, big fan of codenames in general but it plays its
             | best when you're playing against / alongside people that
             | you've known for a while. The metagaming aspect of
             | structuring clues to who your partner is really takes it to
             | the next level.
        
         | jncfhnb wrote:
         | Ehhh I don't think that's accurate. The problem is not linking
         | 4 words. It's linking 4 words without accidentally triggering
         | other, semantically adjacent words.
         | 
         | This task could probably be solved nearly just as well with old
         | school word 2 vec embeddings
        
           | lolinder wrote:
           | Right, that's what I meant to be getting at: when you connect
           | 4 words with as much stretching as o1 did there, you're
           | running a real risk that the other party connects a different
           | set. Unless that other party is also you and has the same
           | learned connections at top of mind.
        
           | furyofantares wrote:
           | > This task could probably be solved nearly just as well with
           | old school word 2 vec embeddings
           | 
           | I've tried. This approach is well beyond awful.
        
       | jerkstate wrote:
       | You can pretty reliably get 2-clues and sometimes good 3-clues
       | just using word2vec embedding similarity
        
       | simonw wrote:
       | I've been trying out various "reasoning" models (o1, R1, Gemini
       | Thinking etc) against the NYT Connections word puzzle - it's a
       | really interesting test of them. So far o1 Pro has been the most
       | consistently successful:
       | https://www.nytimes.com/games/connections
        
         | topaz0 wrote:
         | Wonder if they use llms to write those puzzles
        
       | macromaniac wrote:
       | I made one where you play with the AI a few years back instead of
       | AI v AI but never posted it anywhere if anyone wants to try, just
       | updated it to gpt-4o-mini https://wordswithrobots.isotropic.us/
        
         | blakeburch wrote:
         | Love the idea! Just wish you could clarify a number like you do
         | in codenames. Otherwise, it just keeps going until all of its
         | options are wrong.
        
           | macromaniac wrote:
           | True, because then it feels more intentional (+ the extra
           | strategy). It was definitely a bit thrown together- atm I
           | only ever use it when I need a bit of practice before playing
           | codenames.
        
       | Amekedl wrote:
       | "o1 is more knowledgeable than the average human"
       | 
       | "the toyota yaris can move faster than the average human"
       | 
       | even opt-125m from years ago can pull more facts than the average
       | human.
        
       | bongodongobob wrote:
       | I played it with 3.5 and it was great. This isn't something o1
       | just picked up on.
        
       | lsy wrote:
       | Some of these clues wouldn't be very good for a human playing.
       | "007" for example isn't a very good clue for "laser", not only
       | because something _happening to be_ in one of several films about
       | a character doesn 't rise to the typical level of salience, but
       | also because other words on-board like "shark" and "astronaut"
       | even moreso meet the criterion of featuring prominently in James
       | Bond movies, and "astronaut" appears to be a game-ending choice.
        
       | yantrams wrote:
       | I cracked myself up with a ridiculous train of thought for fun
       | while playing Codenames once. It went a little something like
       | this
       | 
       | Star => Twinkle => Twinkle Khanna => Married to Akshay Kumar =>
       | Canadian Citizen => Maple Syrup ( Leaf ? )
        
       | lynguist wrote:
       | I kinda have the same very subjective feeling where o1 is the
       | first AI that is clearly superior to me.
        
       | some_random wrote:
       | I don't find this remotely compelling, I can easily come up with
       | clues that make sense to me to connect a ton of words the
       | difficulty is coming up with clues that others will look at the
       | same way. The last example is exactly what I mean, "paper" makes
       | sense for those 4 only when you explain it. If "Line" counts then
       | why not "Gum" (which is typically wrapped in paper) or if
       | "Lawyer" is valid then why not "King" (who's decrees are written
       | on what?).
        
       | jinyang0220 wrote:
       | Dude I looooooooooved that game. How long did u spend building
       | it?
        
       ___________________________________________________________________
       (page generated 2025-01-25 23:00 UTC)