[HN Gopher] OpenAI's o1 Playing Codenames
___________________________________________________________________
OpenAI's o1 Playing Codenames
Author : suveen_ellawela
Score : 195 points
Date : 2025-01-22 06:21 UTC (3 days ago)
(HTM) web link (suveenellawela.com)
(TXT) w3m dump (suveenellawela.com)
| JaggerFoo wrote:
| I did this with Claude over the holidays. Putting Claude in the
| role as a guesser and comparing the guess to another experience
| human player. It turns out they both matched each other.
| suveen_ellawela wrote:
| That's a nice experiment! I think codenames could definietly be
| an evaluation method for LLMs.
| pieix wrote:
| Elo on different card games/board games would be a great eval
| metric now that the systems are general enough to play
| Codenames, chess, poker...
| __MatrixMan__ wrote:
| It would be fun to build one, perhaps mediated by an app,
| where you have to guess whether your spymaster is a human or
| an AI based on the quality of their choices.
| zeroonetwothree wrote:
| The average human is quite bad. It really works well when
| the spymaster is (a) experienced and (b) familiar with the
| other players.
| __MatrixMan__ wrote:
| It's the (b) case I'm interested in. Like the spymaster
| loses if they can't subtly indicate to their friends that
| they're the real deal. Otherwise the robots win.
| joaomacp wrote:
| I tried whatever the multi-modal paid ChatGPT model is on the
| Codenames Pictures version, and it didn't fare that well. Since
| they will probably scrape this comment and add it to next model's
| training data, I look forward to it getting good!
| kennyloginz wrote:
| Could this just be a case of Reddit being included in the
| training data?
|
| " I read through codenames official rules to see if using "007"
| as a clue was allowed, and it turns out it is! To my surprise, I
| even came across a Reddit post where people were discussing and
| justifying why this clue fits perfectly within the rules."
| JohnMakin wrote:
| Yea, initially I thought this post was satire because of this.
| tsroe wrote:
| Fun quirk about this game: If there aren't too many cards left
| and your teammate knows their powers of two, you have a winning
| strategy. You simply lay a mental bitmap over all remaining
| cards, setting 1 for cards that belong to your team and 0 for all
| others. You can then just say the number that is represented by
| this bitmap, e.g. "five" for 0101, and your teammate can decode
| it in their head. All numbers are, after all, single words. This
| means, if you are very good at mental maths or you allow for a
| calculator, you could also win every game in the first round. For
| me personally however, it only becomes feasible with around 10
| cards remaining.
| RedNifre wrote:
| That's against the rules.
| Klaster_1 wrote:
| Guys I was playing with declared a similar move against the
| rules, so it was back to the old latent space search.
| Smaug123 wrote:
| It is _explicitly_ against the rules
| (https://czechgames.com/files/rules/codenames-rules-en.pdf),
| so they were correct. "Your clue must be about the meaning of
| the words. You can't use your clue to talk about the letters
| in a word or its position on the table."
| andrepd wrote:
| This is explicitly against the rules.
| tweakimp wrote:
| What if the game showed a different order of cards to every
| player?
| wccrawford wrote:
| Because the original was a tabletop game, it can't.
|
| The digital version could and _should_ do this, IMO. (I don
| 't actually know if it does, though, as I've only played the
| digital version a few times.)
| thrance wrote:
| I mean, it's playing against itself, not really a fair comparison
| to humans in my mind. The fun and hard part of this game is to
| get into your teammates brains and decipher what they possibly
| meant with what they played.
| xnickb wrote:
| Somehow I expected AI to give clues that combine 4-5-6 words at a
| time. It's not at all impressive to me. And I'm not a serious
| player at all
| pama wrote:
| I was wondering about the same. It is possible that the
| instructions didn't try to make the gameplay as aggressive as
| possible. A good model could optimize the separator to make it
| easy to guess the most words possible. By having access to its
| own state, it should be possible to reach 5-6 words in most
| cases. There is an argument for keeping words around that would
| increase the difficulty of the opponents guessing large/clean
| separations, so it is possible that optimal play includes
| simple pairs on occasion. Very interesting application
| nonetheless.
| vitus wrote:
| > It is possible that the instructions didn't try to make the
| gameplay as aggressive as possible.
|
| In case you're wondering, the prompts are available here:
| https://github.com/SuveenE/codenames-
| ai/blob/main/utils/prom...
| vitus wrote:
| I am similarly less-than-impressed. If you click through to the
| website, you can watch the replay of one of the games mentioned
| in the article (the one with the clue "invader").
|
| In that instance, the clues all matched 2-3 words, and the
| winning team got lucky twice (they guessed an unclued word
| using an unintended correlation, and their opponent guessed a
| different one of their unclued words.)
|
| You also see a number of instances where the agents continue
| guessing words for a clue even though they've already gotten
| enough matches. For instance, in round 2, for the clue "Japan
| (2)", the blue team guesses sumo and cherry, then goes for a
| rather tenuous followup guess for round 1's 007 with "ring"
| (despite having gotten the two clued matches in the first
| round). A sillier example is in the final round, where the Red
| Team guesses 3 clues (thereby identifying all nine of their
| target words), then going ahead and guessing another word.
|
| (For what it's worth, I think "shark" would have been a better
| guess for another 007 tie-in seeing as there are multiple Bond
| movies with sharks, but it's also not a match, and again, I
| wouldn't have gone for a third guess here when there were only
| two clued words.)
| garretraziel wrote:
| This is allowed by the rules though. You can guess +1 to the
| number specified.
| topaz0 wrote:
| They know it's allowed. It's also terrible and non-sensical
| strategy in the specific cases that are described.
| wwtl12 wrote:
| The Mechanical Turk is super impressive if you don't know how
| it works.
| croes wrote:
| Is that really surprising?
|
| It's basically the same brain playing with itself. Seems quite
| natural to link the code names to the same words.
|
| Let different LLMs play.
| deredede wrote:
| This is the take I thought I'd have, but in the last example,
| the guesser model reaches the correct conclusion using a
| different reasoning than the clue giver model.
|
| The clue giver justifies the link of Paper and Log as "written
| records", and between Paper and Line as "lines of text". But
| the guesser model connects Paper and Log because "paper is made
| from logs" (reaching the conclusion through a different meaning
| of Log), and connects Paper and Line because "'lined paper' is
| a common type of paper".
|
| Similarly, in the first example, the clue giver connects
| Monster and Lion because lions are "often depicted as a
| mythical beast or monster in legends" (a tenuous connection if
| you ask me), whereas the guesser model thought about King
| because of King Kong (which I also prefer to Lion).
| unlikelymordant wrote:
| generally there is a "temperature" parameter that can be used
| to add some randomness or variety to the LLMs outputs by
| changing the likelihood of the next word being selected. This
| means you could just keep regenerating the same response and
| get different answers each time. each time it will give
| different plausible responses, and this is all from the same
| model. This doesn't mean it believes any of them, it just
| keeps hallucinating likely text, some of which will fit
| better than others. It is still very much the same brain (or
| set of trained parameters) playing with itself.
| wizzwizz4 wrote:
| > _But the guesser model connects Paper and Log because
| "paper is made from logs" (reaching the conclusion through a
| different meaning of Log)_
|
| No, it doesn't. It reaches the conclusion because of vector
| similarity (simplified explanation): these explanations are
| _post-hoc_.
| Angostura wrote:
| Sorry, I'm uninformed. Do you mean thaw the explanation
| could be completely unrelated to the _actual_ "reason"
| DiscourseFan wrote:
| Yes, the reason is that the model assigns words positions
| in an ever-changing vector space and evaluates relation
| by their correspondence in that space--the reply it gives
| is also a certain index of that space, with the "why" in
| the question giving it the weight of producing an
| "answer."
|
| Video series on the topic:
| https://www.3blue1brown.com/topics/neural-networks
|
| Which is to say that "why" it gives those answers is
| because its statistically likely within its training data
| that when there are the words, "why did you connect line
| and log with paper" the text which follows could be "logs
| are made of wood and lines are in paper." But that is not
| the specific relation of the 3 words in the model itself,
| which is just a complex vector space.
| jprete wrote:
| I definitely think it's doing more than that here (at
| least inside of the vector-space computations). The model
| probably directly contains the paper-wood-log
| association.
| jncfhnb wrote:
| If an LLM states an answer and then provides a
| justification for that answer, the justification is
| entirely irrelevant to the reasoning the bot used. It
| might be that the semantics of the justification happen
| to align with the implied logic of the internal vector
| space, but it is best case a manufactured coincidence.
| It's not different from you stating an answer and then
| telling the bot to justify it.
|
| If an LLM is told to do reasoning and then state the
| answer, it follows that the answer is basically
| guaranteed to be derived from the previously generated
| reasoning.
| ActivePattern wrote:
| The answer will likely match what the reasoning steps
| bring it to, but that doesn't mean the computations by
| the LLM to get that answer are necessarily approximated
| by the outputted reasoning steps. E.g. you might have an
| LLM that is trained on many examples of Shakespearean
| text. If you ask it who the author of a given text is, it
| might give some more detailed rationale for why it is
| Shakepeare, when the real answer is "I have a large prior
| for Shakespeare".
| lmm wrote:
| > these explanations are post-hoc.
|
| The best available evidence suggests this is also true of
| any explanations a human gives for their own behaviour;
| nevertheless we generally accept those at face value.
| chongli wrote:
| Of course! If you've played Codenames and introspected on
| how you play you can see this in action. You pick a few
| words that feel similar and then try to justify them.
| Post-hoc rationalization in action.
| topaz0 wrote:
| Except you also examine the rationalization as part of
| deciding whether to act on the impulse or not.
| chongli wrote:
| Yes and you may search for other words that fit the
| rationalization to decide whether or not it's a good one.
| You can go even further if your teammates are people you
| know fairly well by bringing in your own knowledge of
| these people and how they might interpret the clues.
| There's a lot of strategy in Codenames and knowledge of
| vocabulary and related words is only part of it.
| wizzwizz4 wrote:
| The explanations I give of my behaviour are post-hoc
| (unless I was paying attention), but I also assess their
| plausibility by going "if this were the case, how would I
| behave?" and seeing how well that prediction lines up
| with my _actual_ behaviour. Over time, I get good at
| providing explanations that I have no reason to believe
| are false - which also tend to be explanations that allow
| other people to predict my behaviour (in ways I didn 't
| anticipate).
|
| GPT-based predictive text systems are _incapable_ of
| introspection of any kind: they _cannot_ execute the
| algorithm I execute when I 'm giving explanations for my
| behaviour, nor can they execute _any_ algorithm that
| might actually result in the explanations becoming or
| approaching truthfulness.
|
| The GPT model is describing a fictional character named
| ChatGPT, and telling you why ChatGPT thinks a certain
| thing. ChatGPT-the-character is _not_ the GPT model. The
| GPT model has no conception of itself, and cannot ever
| possibly develop a conception of itself (except through
| philosophical inquiry, which the system is incapable of
| for _different_ reasons).
| DominikPeters wrote:
| This is o1 so it need not be post hoc but the result of
| reasoning about several possible choices and explanations.
| ushiroda80 wrote:
| Yeah not sure what's impressive about this. Having the model be
| both the guesser and clue giver will of course have good
| results as it's simply a reflections of o1's weighting of
| tokens.
|
| Interestingly this could be a way to potentially reverse
| engineer o1's weightings
| elicksaur wrote:
| Or, have it play a human and compare human-human and llm-human
| pairs.
| fercircularbuf wrote:
| I've intuitively felt that this general class of task is what
| these LLMs are absolutely best at. I'm not an expert on these
| things, but isn't this thanks to word embeddings and how words
| are mapped into high dimensional vector space within the model? I
| would imagine that because every word is mapped this way, finding
| a word that exists in the same area as mail, lawyer, log, and
| line in some vector space would be trivial for the model to do,
| right?
| infinitifall wrote:
| More than just words. I've found LLMs immensely helpful for
| searching through the latent space or essence of
| quotes/books/movies/memes. I can ask things like "whats that
| book/movie set in X where Y happens" or "whats that quote by a
| P which goes something like Q" in my own paraphrased way and
| with a little prodding, expect the answer. You'd have no luck
| with traditional search engines unless someone has previously
| asked a similar question.
| captn3m0 wrote:
| I've been trying to do this with just word2vec, instead of
| throwing an LLM, since you just need to find a word with the
| appropriate distances optimized.
| https://github.com/captn3m0/ideas?tab=readme-ov-file#codenam...
| dartos wrote:
| I love this.
|
| Imagine the energy savings if more people didn't just
| automatically reach for LLMs for their pet projects.
| zeroonetwothree wrote:
| I tried this many years ago (before LLMs) with hundreds of real
| human games and it was never that good.
| qqqult wrote:
| I did that last summer, I compared the performance of different
| english word embedding models, as far as I remember the best
| ones were GloVe and a few knowledge graph word embeddings.
|
| None of them were better than a human at giving hints for 3+
| words though
| tweakimp wrote:
| It would be really interesting to see an LLM watch other players
| and learn how they think to find the best clues THEY need to hear
| to find the right words.
| progrus wrote:
| GPT-3 was superhuman at this too
| sylware wrote:
| If it can port c++ to C99+ and write correct 64bit risc-v
| assembly...
| jprete wrote:
| Codenames is absolutely dead-center of what I expect _Large
| Language Models_ to be good at. The fundamental skills of the
| game are: having an excellent embedding for word semantics and
| connotations; modeling other people 's embeddings; a little bit
| of game strategy related to its competitive nature.
| badgersnake wrote:
| Or just play with your friends?
| zeroonetwothree wrote:
| I don't find this "super good". It's mostly giving 2 clues which
| is the most basic level of competence. The paper 4 clue is
| reasonable but a bit lucky (eg Jack is also a good guess). I also
| don't see it actually using tactics properly, which I would
| consider part of being "super good". The game isn't just about
| picking a good clue each round!
|
| Now obviously it's still pretty decent at finding the clues.
| Probably better than a random human who hasn't played much. Just
| I find the post's level of hype overstated. It feels like the
| author isn't very experienced with Codenames.
|
| It would be interesting to compare AI:human vs human:human games
| to see which does better. It seems like AI:AI will overstate its
| success.
| groggo wrote:
| Can you elaborate on some of the more advanced tactics?
|
| When I play, it's mostly about getting a good 2 clue each time.
| Then if you can opportunistically get a 3 or 4, that's awesome.
|
| Some tactics come in for choosing the right pairs of 2's so you
| don't end up mismatched, or leaving clues that might be
| ambiguous with your opponent's... But that's mostly it.
|
| It'll be fun for multiplayer! Just like how in other online
| games you can add in a AI to play as one of the players.
| mtmickush wrote:
| Other advanced tactics involve giving a broad clue that
| matches 3-4 of your own and just one other (either your
| opponents or a civilian). Your team can pick up all the
| matches across several turns and the one off doesn't hurt as
| much as the plus four helps
| hunter2_ wrote:
| The S-tier tactic: When that high-number clue is cut short
| by a turn-ending mistake, the guessers tell their clue
| giver to inflate the number given during the totally
| unrelated next clue by however many remained from the
| truncated turn for which they don't need additional
| information to locate (and therefore it would be wasteful
| for a future clue to re-group those) so the stated number
| of that next clue must allow for its own cards plus the
| prior cards.
|
| Example: The clue is "places 4" and the guessers choose 1
| correctly and then 1 wrong answer, but they had achieved
| consensus about 2 others (and are confused about only the
| remaining 1). So the turns ends but they inform the clue
| giver to inflate by 2 next turn. That clue giver (after the
| other team goes) will then say the clue is "people 5" and
| the guessers will know that they shall select 2 places and
| 3 people.
|
| This can cascade beyond just a pair of turns.
| ruds wrote:
| I don't think this sort of communication from guessers to
| clue giver is in the spirit of the game (at least in my
| play group). However, inflating later clues is a
| reasonable approach! It's just that I don't think you're
| allowed to communicate the amount of inflation. Guessers
| must determine whether people 5 has slack to allow
| additional guesses on previous clues.
| hunter2_ wrote:
| You're free to add additional prohibitions on
| communication as a house rule I guess, but the only
| prohibition in the rule book I've seen is that the clue
| giver's speech must consist exclusively of clues (and
| private consultation with the other clue giver). The clue
| giver is free to adjust their clue in reaction to
| anything they hear, and guessers can speak freely.
|
| Important: the clue giver cannot acknowledge the
| instruction during gameplay. That would certainly extend
| beyond giving a clue! The guessers must know that their
| clue giver can play this way prior to the game
| commencing.
|
| Edit: I just consulted the rules and this is the most
| relevant section:
|
| > If you are a field operative, you should focus on the
| table when you are making your guesses. Do not make eye
| contact with the spymaster while you are guessing. This
| will help you avoid nonverbal cues.
|
| > When your information is strictly limited to what can
| be conveyed with one word and one number, you are playing
| in the spirit of the game.
|
| The author's use of the pronoun "you/your" switches from
| field ops in that first paragraph to spymasters in that
| second paragraph, confusingly. With that in mind, it
| boils down to this: field ops cannot seek non-clue
| information from spymasters, and spymasters cannot convey
| non-clue information. The strategy I'm suggesting
| involves neither!
| ALittleLight wrote:
| If you take this idea of communication restrictions to
| the limit, you could imagine the guessers identifying N
| sets of cards by a single word each as they discuss their
| guess. The clue giver listens, then uses the clue that
| identifies the correct set of N cards.
|
| You really just need an algorithm to generate unique sets
| of 8 or 9 from the whole board, and identifies those sets
| by a word.
| groggo wrote:
| Yeah it's interesting to take these ideas to the
| extreme... even at the lower end I don't like it, I think
| zero communication outside of clues is the best way to
| follow the spirit of the game. But a little bit of banter
| and "kibitzing" is what makes it fun too.
| ta_1138 wrote:
| The communication is only necessary/important if people
| haven't set this as a convention in the first place. I'll
| say that prior to ever looking at my clues: "I will give
| you higher numbers than what I said if you miss by more
| than 1. THe number I pick will always be high enough as
| to allow you to, with the +1 guess you get for free, make
| guesses on all the words I was hinting at.
|
| There's also all kinds of not necessarily intended
| communicaton from the guessers in the fact that you can
| listen to which words they were considering and didn't
| pick. Nothing in the game attempt to say that you should
| not consider, say, whether they were going in the right
| or wrong direction in their guessing, but it sure can
| make a difference in how to approach later clues. If they
| were being very wrong, there might be a need to double up
| on words that you intended, and that your guessers
| missed.
|
| In the same fashion, nothing in the game saying that I
| cannot listen to those guesses as a member of the other
| team, whether guesser or spymaster, and then change
| behaviors to make sure we don't hit words they considered
| as candidate words without very good reasons. Let them
| double dip on mistakes, or not make their difficult
| decisions easier. It's not as if the game demands that
| everyone that isn't currenly guessing should wear
| headphones to be sure they disregard what the other team
| says or does.
| foota wrote:
| You can of course play however you want (and I certainly
| think this is clever), but imo this is likely against the
| spirit, and perhaps letter, of the rules.
|
| The rule on giving clues is:
|
| "If you are the spymaster, you are trying to think of a
| one-word clue that relates to some of the words your team
| is trying to guess. When you think you have a good clue,
| you say it. You also say one number, _which tells your
| teammates how many codenames are related to your clue_. "
| (emphasis mine).
|
| The rule states that the number should be the number of
| words related to the clue. There is later provisions
| _allowing_ you to use zero and infinity, but outside of
| these carve-outs (and imo the "allowed" language is
| telling here, since it implies any other number not equal
| to the number of words is not allowed) I don't think this
| is legal.
| lostlogin wrote:
| > the one off doesn't hurt as much as the plus four helps
|
| Doesn't the turn end if you hit the opponents word?
| topaz0 wrote:
| Yes, but they can go back for those words in future
| rounds
| jncfhnb wrote:
| If you really want to get good, your goal is not so much to
| get as many tiles as possible, but rather to get the tiles
| that are semantically distinct from your opponent's. A single
| mistake that triggers your opponent's tile is generally
| enough to lose the game. And even if they don't do it, having
| them uncover the tiles from their side that are semantically
| similar to your own team is also useful.
|
| If you want to get nasty, you learn to abuse the fact that
| the tile layouts follow rules and that you can rule out
| certain tiles without considering the words.
| groggo wrote:
| Memorizing the tile layouts is too much for me haha (imo
| against the spirit of the game). I usually play online now
| anyway so I hope they don't follow those same patterns as
| the physical version.
| bentcorner wrote:
| > you learn to abuse the fact that the tile layouts follow
| rules and that you can rule out certain tiles without
| considering the words.
|
| Can you clarify? Isn't the card placement random?
| jncfhnb wrote:
| It's randomish. There are facts about the possible
| layouts you can memorize.
|
| I don't know most of these rules but there's never 5 in a
| row; or even 4 in a row if you're the team with one fewer
| (second team to play).
|
| Edit: because the game layout is determined by choosing
| one of a few dozen possible layout cards and randomly
| rotating it
| tveita wrote:
| There are 40 setup cards with 4 possible rotations that
| specify agent placements, so it's theoretically possible
| to do some kind of memorization.
|
| Personally I'd find that kind of play style very unfun,
| and would rather switch to fully randomized boards if I
| played enough that it became a problem.
|
| https://danluu.com/codenames/
| harrall wrote:
| I find the game more about reading the people on your team
| (and the other team) to understand how they think.
|
| You have to give entirely different clues depending on the
| people you play with.
|
| Sometimes you can also play adversarial and introduce doubt
| into the opposing team by giving topic-adjacent clues that
| cause them to avoid one of their own cards. It works better
| if someone on the other team tends to be a big doubter. It
| also can work when the other team constantly goes back and
| tries to pick n+1 cards that they think they missed from the
| last round, which gives you a lot of room to psychologically
| mess with them.
|
| Sometimes you have a clue that only really matches 2, but
| because only 1 of the wrong matches is a neutral card and you
| could match 2 more by a massive stretch, you say "4." Worse
| case, they get 2 right but then they pick the neutral card
| but in the best case, you stand to gain 4 for a clue that
| should only match 2.
|
| I like Codenames because they are many meta ways to play the
| game. What makes Codenames unique is that, unlike a lot of
| other games (Catan, Secret Hitler, CAH, etc.), it's an
| adversarial team game where the team dynamics and discussions
| are not secret so you can use them to your advantage.
| blix wrote:
| experienced players who know their teammates well can
| reliably get 3-4s. if you only go for safe 2s against these
| opponents you will lose every time.
| dang wrote:
| Ok, we've taken supergoodness out of the title now. Presumably
| the post is still interesting!
|
| (Submitted title was "I got OpenAI o1 to play the boardgame
| Codenames and it's super good".)
| jsemrau wrote:
| I have been doing some experiments with Agents, Reinforcement
| Learnings playing a 4x4 Tic Tac Toe game.[1]. Given my analysis
| of the "thought" process we are still really far from true
| understanding of such games. While in my game as well as OP"s,
| the rules are pre-trained and the models are good enough to reach
| a conclusion (which in itself is already impressive), it is still
| a long way.
|
| [1] https://jdsemrau.substack.com/p/nemotron-vs-qwen-game-
| theory...
| lolinder wrote:
| A small weakness in this test is that one of the keys to
| strategic Codenames play is understanding your partner. You're
| not just trying to connect the words, you're trying to connect
| them in a way that will be obvious to your partner. As a
| computing analogy: you're trying to serialize a few cards in a
| way that will be deserializable by the other player.
|
| This test pairs o1 with itself, which means the serializer _is_
| the deserializer. So while it 's impressive that it can link 4
| words, most humans could also easily link 4 with as much
| stretching! We just don't tend to because we can't guarantee that
| the other human will make the same connections we did.
| ModernMech wrote:
| lol I played this game with my family and they said my wife and
| I were cheating because I kept using inside jokes that made no
| sense to them but she would get immediately.
| dgritsko wrote:
| That's a big part of what makes this game enjoyable - a clue
| that is very obvious to one person might not even cross the
| mind of someone else. To anyone reading this who hasn't
| played, it's definitely worth giving it a try.
| slyn wrote:
| Agreed, big fan of codenames in general but it plays its
| best when you're playing against / alongside people that
| you've known for a while. The metagaming aspect of
| structuring clues to who your partner is really takes it to
| the next level.
| jncfhnb wrote:
| Ehhh I don't think that's accurate. The problem is not linking
| 4 words. It's linking 4 words without accidentally triggering
| other, semantically adjacent words.
|
| This task could probably be solved nearly just as well with old
| school word 2 vec embeddings
| lolinder wrote:
| Right, that's what I meant to be getting at: when you connect
| 4 words with as much stretching as o1 did there, you're
| running a real risk that the other party connects a different
| set. Unless that other party is also you and has the same
| learned connections at top of mind.
| furyofantares wrote:
| > This task could probably be solved nearly just as well with
| old school word 2 vec embeddings
|
| I've tried. This approach is well beyond awful.
| jerkstate wrote:
| You can pretty reliably get 2-clues and sometimes good 3-clues
| just using word2vec embedding similarity
| simonw wrote:
| I've been trying out various "reasoning" models (o1, R1, Gemini
| Thinking etc) against the NYT Connections word puzzle - it's a
| really interesting test of them. So far o1 Pro has been the most
| consistently successful:
| https://www.nytimes.com/games/connections
| topaz0 wrote:
| Wonder if they use llms to write those puzzles
| macromaniac wrote:
| I made one where you play with the AI a few years back instead of
| AI v AI but never posted it anywhere if anyone wants to try, just
| updated it to gpt-4o-mini https://wordswithrobots.isotropic.us/
| blakeburch wrote:
| Love the idea! Just wish you could clarify a number like you do
| in codenames. Otherwise, it just keeps going until all of its
| options are wrong.
| macromaniac wrote:
| True, because then it feels more intentional (+ the extra
| strategy). It was definitely a bit thrown together- atm I
| only ever use it when I need a bit of practice before playing
| codenames.
| Amekedl wrote:
| "o1 is more knowledgeable than the average human"
|
| "the toyota yaris can move faster than the average human"
|
| even opt-125m from years ago can pull more facts than the average
| human.
| bongodongobob wrote:
| I played it with 3.5 and it was great. This isn't something o1
| just picked up on.
| lsy wrote:
| Some of these clues wouldn't be very good for a human playing.
| "007" for example isn't a very good clue for "laser", not only
| because something _happening to be_ in one of several films about
| a character doesn 't rise to the typical level of salience, but
| also because other words on-board like "shark" and "astronaut"
| even moreso meet the criterion of featuring prominently in James
| Bond movies, and "astronaut" appears to be a game-ending choice.
| yantrams wrote:
| I cracked myself up with a ridiculous train of thought for fun
| while playing Codenames once. It went a little something like
| this
|
| Star => Twinkle => Twinkle Khanna => Married to Akshay Kumar =>
| Canadian Citizen => Maple Syrup ( Leaf ? )
| lynguist wrote:
| I kinda have the same very subjective feeling where o1 is the
| first AI that is clearly superior to me.
| some_random wrote:
| I don't find this remotely compelling, I can easily come up with
| clues that make sense to me to connect a ton of words the
| difficulty is coming up with clues that others will look at the
| same way. The last example is exactly what I mean, "paper" makes
| sense for those 4 only when you explain it. If "Line" counts then
| why not "Gum" (which is typically wrapped in paper) or if
| "Lawyer" is valid then why not "King" (who's decrees are written
| on what?).
| jinyang0220 wrote:
| Dude I looooooooooved that game. How long did u spend building
| it?
___________________________________________________________________
(page generated 2025-01-25 23:00 UTC)