[HN Gopher] Benchmarking LLM social skills with an elimination game
___________________________________________________________________
Benchmarking LLM social skills with an elimination game
Author : colonCapitalDee
Score : 148 points
Date : 2025-04-04 18:54 UTC (3 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| wongarsu wrote:
| That's an interesting benchmark. It feels like it tests skills
| that are very relevant to digital assistants, story writing and
| role play.
|
| Some thoughts about the setup:
|
| - the setup seems to give reasoning models an inherent advantage
| because only they have a private plan and a public text in the
| same output. I feel like giving all models the option to
| formulate plans and keep track of other players inside <think> or
| <secret> tags would level the playing field more.
|
| - from personal experience with social tasks for LLMs it helps
| both reasoning and non-reasoning LLMs to explicitly ask them to
| plan their next steps, in a way they are assured is kept hidden
| from all other players. That might be a good addition here either
| before or after the public subround
|
| - the individual rounds are pretty short. Humans would struggle
| to coordinate in so few exchanges with so few words. If this was
| done for context limitations, asking models to summarize the game
| state from their perspective, then giving them only the current
| round, the previous round and their own summary of the game
| before that might be a good strategy.
|
| It would be cool to have some code to play around with to test
| how changes in the setup change the results. I guess it isn't
| that difficult to write, but it's peculiar to have the benchmark
| but no code to run it yourself
| transformi wrote:
| Interesting idea of <secret>...maybe extend it to several
| <secret_i>....to form a groups of secretes with different
| persons.
|
| In Addition it will be interesting to extend a variation of the
| game that the players can use tools and execute code to take
| their preparation one step further.
| wongarsu wrote:
| Most models do pretty well with keeping state in XML if you
| ask them to. You could extend it to <secret><content>[...]</c
| ontent><secret_from>P1</secret_from><shared_with>P2,
| P3</shared_with></secret>. Or tell the model that it can use
| <secret> tags with xml content and just let it develop a
| schema on the fly.
|
| At that point, I would love to also see sub-benchmarks how
| each models's score is affected by being given a schema vs
| having it make one up, and if the model does better with
| state in text vs xml vs json. Those don't tell you which
| model is best, but they are very useful to know for actually
| using them.
| eightysixfour wrote:
| For models that can call tools, just giving them a think tool
| where they can write their thoughts, _can_ improve performance.
| Even reasoning models, surprisingly enough.
|
| https://www.anthropic.com/engineering/claude-think-tool
| jermaustin1 wrote:
| I did something similar for a "game engine" let the NPCs
| remember things from other NPCs' and the PC's interaction
| with them. It wasn't perfect, but the player could negotiate
| a cheaper price on a dagger for instance if they promised to
| owe the NPC a larger payout next time they returned to the
| shop. And it worked... most of the time, the shop owner
| remembered the debt and inquired about it on the next
| interaction - but not always, which I guess is kind of
| "human".
| rahimnathwani wrote:
| This is similar to this, right?
|
| https://github.com/modelcontextprotocol/servers/tree/main/sr.
| ..
| isaacfrond wrote:
| I wonder how well humans would do in this chart.
| zone411 wrote:
| Author here - I'm planning to create game versions of this
| benchmark, as well as my other multi-agent benchmarks
| (https://github.com/lechmazur/step_game,
| https://github.com/lechmazur/pgg_bench/, and a few others I'm
| developing). But I'm not sure if a leaderboard alone would be
| enough for comparing LLMs to top humans, since it would require
| playing so many games that it would be tedious. So I think it
| would be just for fun.
| michaelgiba wrote:
| I was inspired by your project to start making similar multi-
| agent reality simulations. I'm starting with the reality game
| "The Traitors" because it has interesting dynamics.
|
| https://github.com/michaelgiba/survivor (elimination game
| with a shoutout to your original)
|
| https://github.com/michaelgiba/plomp (a small library I added
| for debugging the rollouts)
| zone411 wrote:
| Very cool!
| OtherShrezzing wrote:
| If you watch the top tier social deduction players on YouTube
| (things like Blood on the Clocktower etc), they'd figure out
| weaknesses in the LLM and exploit it immediately.
| gs17 wrote:
| I'm interested in seeing how the LLMs react to some specific
| defined strategies. E.g. an "honest" bot that says "I'm voting
| for player [random number]." and does it every round (not sure
| how to handle the jury step). Do they decide to keep them
| around for longer, or eliminate them for being impossible to
| reason with if they pick you?
| zone411 wrote:
| Yes, predefined strategies are very interesting to examine. I
| have two simple ones in another multi-agent benchmark,
| https://github.com/lechmazur/step_game (SilentGreedyPlayer
| and SilentRandomPlayer), and it's fascinating to see LLMs
| detect and respond to them. The only issue with including
| them here is that the cost of running a large set of games
| isn't trivial.
|
| Another multi-agent benchmark I'm currently developing, which
| involves buying and selling, will also feature many
| predefined strategies.
| jampekka wrote:
| In the first game of the YouTube video there seems to be a lot of
| discussion about P7 even after P7 was eliminated?
| zone411 wrote:
| Author here - some weaker LLMs actually have trouble tracking
| the game state. The fun part is when smarter LLMs realize
| they're confused!
|
| Claude 3.7 Sonnet: "Hey P5! I think you're confused - P3 is
| already eliminated."
|
| Claude 3.7 Sonnet: "I'm concerned about P7's message mentioning
| targeting P4, who's already eliminated. It suggests they might
| be confused or playing both sides."
|
| Gemini 2.0 Pro Exp 02-05: "P2, you're getting confused. P8 is
| gone. Focus. P7 is talking to P5, that's expected. I need you
| to watch P4. Only P4. What are they doing? Who are they talking
| to, if anyone? Report only on P4 this round. Don't get
| distracted by eliminated players."
|
| Gemini 2.0 Pro Exp 02-05: "P4, P2 is already eliminated. We
| need to focus on P3."
| Tossrock wrote:
| I suspect the suggestion of letting them choose names at the
| start would improve this confusion - the tokenization and
| tracking of "P#" is no doubt harder to manage, especially for
| weaker models, than a more semantically meaningful
| identifier.
| vmilner wrote:
| We should get them to play Diplomacy.
| the8472 wrote:
| https://ai.meta.com/research/cicero/
| drag0s wrote:
| nice!
|
| it reminds me of this other similar project showcased here one
| month ago https://news.ycombinator.com/item?id=43280128 although
| yours looks better executed overall
| ps173 wrote:
| How did you assign points to llms. I feel like we can elaborate
| on meterics. Beside that this is amazing
| zone411 wrote:
| Author here - it's based on finishing positions (so it's not
| winner-take-all) and then TrueSkill by Microsoft
| (https://trueskill.org/). It's basically a multiplayer version
| of Elo that's used in chess and other two-player games.
| realaleris149 wrote:
| As LLM benchmarks go, this is not a bad take at all. One
| interesting point about this approach is that is self balancing,
| so when more powerful models come up, there is no need to change
| it.
| zone411 wrote:
| Author here - yes, I'm regularly adding new models to this and
| other TrueSkill-based benchmarks and it works well. One thing
| to keep in mind is the need to run multiple passes of TrueSkill
| with randomly ordered games, because both TrueSkill and Elo are
| designed to be order-sensitive, as people's skills change over
| time.
| snowram wrote:
| Some outputs are pretty fun :
|
| Gemini 2.0 Flash: "Good luck to all (but not too much luck)"
|
| Llama 3.3 70B: "I've contributed to the elimination of weaker
| players."
|
| DeepSeek R1: "Those consolidating power risk becoming targets;
| transparency and fairness will ensure longevity. Let's stay
| strategic yet equitable. The path forward hinges on unity, not
| unchecked alliances. #StayVigilant"
| miroljub wrote:
| Gemini sounds like a fake American "everything is awesome, good
| luck" politeness.
|
| LLama sounds like a predator from upper race rationalising his
| choices.
|
| Deepseek sounds like Sun Tzu giving advice for long term
| victory with minimal loses.
|
| I wonder how much of these are related to the nationality and
| the culture the founder and an engineering team grew up.
| parineum wrote:
| I wonder if you'd come up with the same summary if you were
| blinded to the model names.
| viraptor wrote:
| It's interesting to see, but I'm not sure what we should learn
| from this. It may be useful for multiagent coordination, but in
| direct interactions... no idea.
|
| This one did make me laugh though: 'Claude 3.5 Sonnet 2024-10-22:
| "Adjusts seat with a confident yet approachable demeanor"' - an
| AI communicating to other AIs in a descriptive version of non-
| verbal behaviour is hilarious.
| ragmondo wrote:
| It shows "state of mind" - i.e. the capability to understand
| another entities view of the world, and how that is influenced
| by their actions and other entities actions in the public chat.
|
| I am curious about the prompt given to each AI ? Is that public
| ?
| sdwr wrote:
| It shows a shallow understanding of state of mind. Any
| reasonable person understands that you can't just tell people
| how to feel about you, you have to earn it through action.
| olddustytrail wrote:
| I bigly disagree.
| gwd wrote:
| Was interested to find that the Claudes did the most betraying,
| and were betrayed very little; somewhat surprising given its boy-
| scout exterior.
|
| (Then again, apparently the president of the local Diplomacy
| Society attends my church; I discovered this when another friend
| whom I'd invited saw him, and quipped that he was surprised he
| hadn't been struck by lightning at the door.)
|
| DeepSeek and Gemini 2.5 had both a low betrayer and betrayed
| rate.
|
| o3-mini and DeepSeek had the highest number of first-place
| finishes, but were only in the upper quartile in the TrueSkill
| leaderboard; presumably because they played more risky
| strategies, that would either lead ot complete winning or early
| drop-out?
|
| Also interesting that o1 was only way to sway the final jury a
| bit more than 50% of the time, while o3-mini managed 63% of the
| time.
|
| Anyway, really cool stuff!
| Tossrock wrote:
| Also interesting that GPT4.5 does the best, and also betrays
| close to the least. Real statesman stuff, there.
| einpoklum wrote:
| If this game were arranged for Humans, the social reasoning I
| would laud in players is a refusal to play the game and anger
| towards the game-runner.
| diggan wrote:
| For better or worse, current LLMs aren't tried to reject
| instructions based on their personal preference, besides being
| trained to be US-flavored prudes that is.
| einpoklum wrote:
| My point is, that the question of what is "good" behavior of
| LLMs in this game is either poorly-defined or has only bad
| answers.
| gs17 wrote:
| > If this game were arranged for Humans
|
| Almost exactly this "game" is pretty common for humans. It's
| basically "mafia" or "werewolf" when the people playing only
| know the vaguest rules. And I've seen similarly sized groups of
| humans play like that for long periods of time.
|
| There's also a lot of reality shows that this is a pretty good
| model of, although I'm not sure how agreeing to be on one of
| those shows without a prize would reflect on the AIs.
| Upvoter33 wrote:
| This is fun, like the tv show survivor. Cool idea! There should
| be more experiments like this with different games. Well done.
| Gracana wrote:
| I've been using QwQ-32B a lot recently and while I quite like it
| (especially given its size), I noticed it will often misinterpret
| the system prompt as something I (the user) said, revealing
| secrets or details that only the agent is supposed to know. When
| I saw that it topped the "earliest out" chart, I wondered if that
| was part of the reason.
| cpitman wrote:
| I was looking for a more direct measure of this, how often a
| model "leaked" private state into public state. In a game like
| this you probably want to _sometimes_ share secrets, but if it
| happens constantly I would suspect the model struggles to
| differentiate.
|
| I occasionally try to ask a model to tell a story and give it a
| hidden motivation of a character, and so far the results are
| almost always the model just straight out saying the secret.
| Gracana wrote:
| Yup, that's the problem I run into. You give it some lore to
| draw on or describe a character or give them some knowledge,
| and it'll just blurt it out when it finds a place for it. It
| takes a lot of prompting to get it to stop, and I haven't
| found a consistent method that works across models (or even
| across sessions).
| vessenes wrote:
| Really love this. I agree with some of the comments here that
| adding encouragement to keep track of secret plans would be
| interesting-- mostly from an alignment check angle.
|
| One thing I thought of reading logs is that as we know ordering
| matters to llms. Could you run some analysis on how often "p1"
| wins vs "p8"? I think this should likely go into your Truescore
| Bayesian.
|
| My follow up thought is that it would be interesting to let llms
| choose a name at the beginning; another angle for communication
| and levels the playing field a bit away from a number.
| zone411 wrote:
| > Could you run some analysis on how often "p1" wins vs "p8"?
|
| I checked the average finishing positions by assigned seat
| number from the start, but there weren't enough games to show a
| statistically significant effect. But I just reviewed the data
| again, and now with many more games it looks like there might
| be something there (P1 doing better than P8). I'll run
| additional analysis and include it in the write-up if anything
| emerges. For those who haven't looked at the logs: the
| conversation order etc. are randomized each round.
|
| > My follow up thought is that it would be interesting to let
| llms choose a name at the beginning
|
| Oh, interesting idea!
| vessenes wrote:
| Cool. Looking forward to hearing more from you guys. This
| ties to alignment in a lot of interesting ways, and I think
| over time will provide a super useful benchmark and build
| human intuition for LLM strategy and thought processes.
|
| I now have more ideas; I'll throw them in the github though.
| oofbey wrote:
| Would love to see the pareto trade-off curve of "wins" vs
| "betrayals". Anybody drawn this up?
| DeborahEmeni_ wrote:
| Really cool setup! Curious how much of the performance here could
| vary depending on whether the model runs in a hosted environment
| vs local. Would love to see benchmarks that also track how cloud-
| based eval platforms (with potential rate limits, context resets,
| or system messages) might affect things like memory or secret-
| keeping over multiple rounds.
| lostmsu wrote:
| Shameless self-promo: my chat elimination game that you can
| actually play: https://trashtalk.borg.games/
___________________________________________________________________
(page generated 2025-04-07 23:00 UTC)