[HN Gopher] Benchmarking LLM social skills with an elimination game
       ___________________________________________________________________
        
       Benchmarking LLM social skills with an elimination game
        
       Author : colonCapitalDee
       Score  : 148 points
       Date   : 2025-04-04 18:54 UTC (3 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | wongarsu wrote:
       | That's an interesting benchmark. It feels like it tests skills
       | that are very relevant to digital assistants, story writing and
       | role play.
       | 
       | Some thoughts about the setup:
       | 
       | - the setup seems to give reasoning models an inherent advantage
       | because only they have a private plan and a public text in the
       | same output. I feel like giving all models the option to
       | formulate plans and keep track of other players inside <think> or
       | <secret> tags would level the playing field more.
       | 
       | - from personal experience with social tasks for LLMs it helps
       | both reasoning and non-reasoning LLMs to explicitly ask them to
       | plan their next steps, in a way they are assured is kept hidden
       | from all other players. That might be a good addition here either
       | before or after the public subround
       | 
       | - the individual rounds are pretty short. Humans would struggle
       | to coordinate in so few exchanges with so few words. If this was
       | done for context limitations, asking models to summarize the game
       | state from their perspective, then giving them only the current
       | round, the previous round and their own summary of the game
       | before that might be a good strategy.
       | 
       | It would be cool to have some code to play around with to test
       | how changes in the setup change the results. I guess it isn't
       | that difficult to write, but it's peculiar to have the benchmark
       | but no code to run it yourself
        
         | transformi wrote:
         | Interesting idea of <secret>...maybe extend it to several
         | <secret_i>....to form a groups of secretes with different
         | persons.
         | 
         | In Addition it will be interesting to extend a variation of the
         | game that the players can use tools and execute code to take
         | their preparation one step further.
        
           | wongarsu wrote:
           | Most models do pretty well with keeping state in XML if you
           | ask them to. You could extend it to <secret><content>[...]</c
           | ontent><secret_from>P1</secret_from><shared_with>P2,
           | P3</shared_with></secret>. Or tell the model that it can use
           | <secret> tags with xml content and just let it develop a
           | schema on the fly.
           | 
           | At that point, I would love to also see sub-benchmarks how
           | each models's score is affected by being given a schema vs
           | having it make one up, and if the model does better with
           | state in text vs xml vs json. Those don't tell you which
           | model is best, but they are very useful to know for actually
           | using them.
        
         | eightysixfour wrote:
         | For models that can call tools, just giving them a think tool
         | where they can write their thoughts, _can_ improve performance.
         | Even reasoning models, surprisingly enough.
         | 
         | https://www.anthropic.com/engineering/claude-think-tool
        
           | jermaustin1 wrote:
           | I did something similar for a "game engine" let the NPCs
           | remember things from other NPCs' and the PC's interaction
           | with them. It wasn't perfect, but the player could negotiate
           | a cheaper price on a dagger for instance if they promised to
           | owe the NPC a larger payout next time they returned to the
           | shop. And it worked... most of the time, the shop owner
           | remembered the debt and inquired about it on the next
           | interaction - but not always, which I guess is kind of
           | "human".
        
           | rahimnathwani wrote:
           | This is similar to this, right?
           | 
           | https://github.com/modelcontextprotocol/servers/tree/main/sr.
           | ..
        
       | isaacfrond wrote:
       | I wonder how well humans would do in this chart.
        
         | zone411 wrote:
         | Author here - I'm planning to create game versions of this
         | benchmark, as well as my other multi-agent benchmarks
         | (https://github.com/lechmazur/step_game,
         | https://github.com/lechmazur/pgg_bench/, and a few others I'm
         | developing). But I'm not sure if a leaderboard alone would be
         | enough for comparing LLMs to top humans, since it would require
         | playing so many games that it would be tedious. So I think it
         | would be just for fun.
        
           | michaelgiba wrote:
           | I was inspired by your project to start making similar multi-
           | agent reality simulations. I'm starting with the reality game
           | "The Traitors" because it has interesting dynamics.
           | 
           | https://github.com/michaelgiba/survivor (elimination game
           | with a shoutout to your original)
           | 
           | https://github.com/michaelgiba/plomp (a small library I added
           | for debugging the rollouts)
        
             | zone411 wrote:
             | Very cool!
        
         | OtherShrezzing wrote:
         | If you watch the top tier social deduction players on YouTube
         | (things like Blood on the Clocktower etc), they'd figure out
         | weaknesses in the LLM and exploit it immediately.
        
         | gs17 wrote:
         | I'm interested in seeing how the LLMs react to some specific
         | defined strategies. E.g. an "honest" bot that says "I'm voting
         | for player [random number]." and does it every round (not sure
         | how to handle the jury step). Do they decide to keep them
         | around for longer, or eliminate them for being impossible to
         | reason with if they pick you?
        
           | zone411 wrote:
           | Yes, predefined strategies are very interesting to examine. I
           | have two simple ones in another multi-agent benchmark,
           | https://github.com/lechmazur/step_game (SilentGreedyPlayer
           | and SilentRandomPlayer), and it's fascinating to see LLMs
           | detect and respond to them. The only issue with including
           | them here is that the cost of running a large set of games
           | isn't trivial.
           | 
           | Another multi-agent benchmark I'm currently developing, which
           | involves buying and selling, will also feature many
           | predefined strategies.
        
       | jampekka wrote:
       | In the first game of the YouTube video there seems to be a lot of
       | discussion about P7 even after P7 was eliminated?
        
         | zone411 wrote:
         | Author here - some weaker LLMs actually have trouble tracking
         | the game state. The fun part is when smarter LLMs realize
         | they're confused!
         | 
         | Claude 3.7 Sonnet: "Hey P5! I think you're confused - P3 is
         | already eliminated."
         | 
         | Claude 3.7 Sonnet: "I'm concerned about P7's message mentioning
         | targeting P4, who's already eliminated. It suggests they might
         | be confused or playing both sides."
         | 
         | Gemini 2.0 Pro Exp 02-05: "P2, you're getting confused. P8 is
         | gone. Focus. P7 is talking to P5, that's expected. I need you
         | to watch P4. Only P4. What are they doing? Who are they talking
         | to, if anyone? Report only on P4 this round. Don't get
         | distracted by eliminated players."
         | 
         | Gemini 2.0 Pro Exp 02-05: "P4, P2 is already eliminated. We
         | need to focus on P3."
        
           | Tossrock wrote:
           | I suspect the suggestion of letting them choose names at the
           | start would improve this confusion - the tokenization and
           | tracking of "P#" is no doubt harder to manage, especially for
           | weaker models, than a more semantically meaningful
           | identifier.
        
       | vmilner wrote:
       | We should get them to play Diplomacy.
        
         | the8472 wrote:
         | https://ai.meta.com/research/cicero/
        
       | drag0s wrote:
       | nice!
       | 
       | it reminds me of this other similar project showcased here one
       | month ago https://news.ycombinator.com/item?id=43280128 although
       | yours looks better executed overall
        
       | ps173 wrote:
       | How did you assign points to llms. I feel like we can elaborate
       | on meterics. Beside that this is amazing
        
         | zone411 wrote:
         | Author here - it's based on finishing positions (so it's not
         | winner-take-all) and then TrueSkill by Microsoft
         | (https://trueskill.org/). It's basically a multiplayer version
         | of Elo that's used in chess and other two-player games.
        
       | realaleris149 wrote:
       | As LLM benchmarks go, this is not a bad take at all. One
       | interesting point about this approach is that is self balancing,
       | so when more powerful models come up, there is no need to change
       | it.
        
         | zone411 wrote:
         | Author here - yes, I'm regularly adding new models to this and
         | other TrueSkill-based benchmarks and it works well. One thing
         | to keep in mind is the need to run multiple passes of TrueSkill
         | with randomly ordered games, because both TrueSkill and Elo are
         | designed to be order-sensitive, as people's skills change over
         | time.
        
       | snowram wrote:
       | Some outputs are pretty fun :
       | 
       | Gemini 2.0 Flash: "Good luck to all (but not too much luck)"
       | 
       | Llama 3.3 70B: "I've contributed to the elimination of weaker
       | players."
       | 
       | DeepSeek R1: "Those consolidating power risk becoming targets;
       | transparency and fairness will ensure longevity. Let's stay
       | strategic yet equitable. The path forward hinges on unity, not
       | unchecked alliances. #StayVigilant"
        
         | miroljub wrote:
         | Gemini sounds like a fake American "everything is awesome, good
         | luck" politeness.
         | 
         | LLama sounds like a predator from upper race rationalising his
         | choices.
         | 
         | Deepseek sounds like Sun Tzu giving advice for long term
         | victory with minimal loses.
         | 
         | I wonder how much of these are related to the nationality and
         | the culture the founder and an engineering team grew up.
        
           | parineum wrote:
           | I wonder if you'd come up with the same summary if you were
           | blinded to the model names.
        
       | viraptor wrote:
       | It's interesting to see, but I'm not sure what we should learn
       | from this. It may be useful for multiagent coordination, but in
       | direct interactions... no idea.
       | 
       | This one did make me laugh though: 'Claude 3.5 Sonnet 2024-10-22:
       | "Adjusts seat with a confident yet approachable demeanor"' - an
       | AI communicating to other AIs in a descriptive version of non-
       | verbal behaviour is hilarious.
        
         | ragmondo wrote:
         | It shows "state of mind" - i.e. the capability to understand
         | another entities view of the world, and how that is influenced
         | by their actions and other entities actions in the public chat.
         | 
         | I am curious about the prompt given to each AI ? Is that public
         | ?
        
           | sdwr wrote:
           | It shows a shallow understanding of state of mind. Any
           | reasonable person understands that you can't just tell people
           | how to feel about you, you have to earn it through action.
        
             | olddustytrail wrote:
             | I bigly disagree.
        
       | gwd wrote:
       | Was interested to find that the Claudes did the most betraying,
       | and were betrayed very little; somewhat surprising given its boy-
       | scout exterior.
       | 
       | (Then again, apparently the president of the local Diplomacy
       | Society attends my church; I discovered this when another friend
       | whom I'd invited saw him, and quipped that he was surprised he
       | hadn't been struck by lightning at the door.)
       | 
       | DeepSeek and Gemini 2.5 had both a low betrayer and betrayed
       | rate.
       | 
       | o3-mini and DeepSeek had the highest number of first-place
       | finishes, but were only in the upper quartile in the TrueSkill
       | leaderboard; presumably because they played more risky
       | strategies, that would either lead ot complete winning or early
       | drop-out?
       | 
       | Also interesting that o1 was only way to sway the final jury a
       | bit more than 50% of the time, while o3-mini managed 63% of the
       | time.
       | 
       | Anyway, really cool stuff!
        
         | Tossrock wrote:
         | Also interesting that GPT4.5 does the best, and also betrays
         | close to the least. Real statesman stuff, there.
        
       | einpoklum wrote:
       | If this game were arranged for Humans, the social reasoning I
       | would laud in players is a refusal to play the game and anger
       | towards the game-runner.
        
         | diggan wrote:
         | For better or worse, current LLMs aren't tried to reject
         | instructions based on their personal preference, besides being
         | trained to be US-flavored prudes that is.
        
           | einpoklum wrote:
           | My point is, that the question of what is "good" behavior of
           | LLMs in this game is either poorly-defined or has only bad
           | answers.
        
         | gs17 wrote:
         | > If this game were arranged for Humans
         | 
         | Almost exactly this "game" is pretty common for humans. It's
         | basically "mafia" or "werewolf" when the people playing only
         | know the vaguest rules. And I've seen similarly sized groups of
         | humans play like that for long periods of time.
         | 
         | There's also a lot of reality shows that this is a pretty good
         | model of, although I'm not sure how agreeing to be on one of
         | those shows without a prize would reflect on the AIs.
        
       | Upvoter33 wrote:
       | This is fun, like the tv show survivor. Cool idea! There should
       | be more experiments like this with different games. Well done.
        
       | Gracana wrote:
       | I've been using QwQ-32B a lot recently and while I quite like it
       | (especially given its size), I noticed it will often misinterpret
       | the system prompt as something I (the user) said, revealing
       | secrets or details that only the agent is supposed to know. When
       | I saw that it topped the "earliest out" chart, I wondered if that
       | was part of the reason.
        
         | cpitman wrote:
         | I was looking for a more direct measure of this, how often a
         | model "leaked" private state into public state. In a game like
         | this you probably want to _sometimes_ share secrets, but if it
         | happens constantly I would suspect the model struggles to
         | differentiate.
         | 
         | I occasionally try to ask a model to tell a story and give it a
         | hidden motivation of a character, and so far the results are
         | almost always the model just straight out saying the secret.
        
           | Gracana wrote:
           | Yup, that's the problem I run into. You give it some lore to
           | draw on or describe a character or give them some knowledge,
           | and it'll just blurt it out when it finds a place for it. It
           | takes a lot of prompting to get it to stop, and I haven't
           | found a consistent method that works across models (or even
           | across sessions).
        
       | vessenes wrote:
       | Really love this. I agree with some of the comments here that
       | adding encouragement to keep track of secret plans would be
       | interesting-- mostly from an alignment check angle.
       | 
       | One thing I thought of reading logs is that as we know ordering
       | matters to llms. Could you run some analysis on how often "p1"
       | wins vs "p8"? I think this should likely go into your Truescore
       | Bayesian.
       | 
       | My follow up thought is that it would be interesting to let llms
       | choose a name at the beginning; another angle for communication
       | and levels the playing field a bit away from a number.
        
         | zone411 wrote:
         | > Could you run some analysis on how often "p1" wins vs "p8"?
         | 
         | I checked the average finishing positions by assigned seat
         | number from the start, but there weren't enough games to show a
         | statistically significant effect. But I just reviewed the data
         | again, and now with many more games it looks like there might
         | be something there (P1 doing better than P8). I'll run
         | additional analysis and include it in the write-up if anything
         | emerges. For those who haven't looked at the logs: the
         | conversation order etc. are randomized each round.
         | 
         | > My follow up thought is that it would be interesting to let
         | llms choose a name at the beginning
         | 
         | Oh, interesting idea!
        
           | vessenes wrote:
           | Cool. Looking forward to hearing more from you guys. This
           | ties to alignment in a lot of interesting ways, and I think
           | over time will provide a super useful benchmark and build
           | human intuition for LLM strategy and thought processes.
           | 
           | I now have more ideas; I'll throw them in the github though.
        
       | oofbey wrote:
       | Would love to see the pareto trade-off curve of "wins" vs
       | "betrayals". Anybody drawn this up?
        
       | DeborahEmeni_ wrote:
       | Really cool setup! Curious how much of the performance here could
       | vary depending on whether the model runs in a hosted environment
       | vs local. Would love to see benchmarks that also track how cloud-
       | based eval platforms (with potential rate limits, context resets,
       | or system messages) might affect things like memory or secret-
       | keeping over multiple rounds.
        
       | lostmsu wrote:
       | Shameless self-promo: my chat elimination game that you can
       | actually play: https://trashtalk.borg.games/
        
       ___________________________________________________________________
       (page generated 2025-04-07 23:00 UTC)