[HN Gopher] Chain of Recursive Thoughts: Make AI think harder by...
       ___________________________________________________________________
        
       Chain of Recursive Thoughts: Make AI think harder by making it
       argue with itself
        
       Author : miles
       Score  : 302 points
       Date   : 2025-04-29 17:19 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | hnuser123456 wrote:
       | I'm having a lot of fun experimenting with stuff like this. I'm
       | trying to put together an unrealengine blueprints style graph
       | editor to allow people to design workflows like this where you
       | start with the user prompt input, which goes to one agent, which
       | makes an initial attempt, and then that conversation history gets
       | passed to another "agent" with a different system prompt telling
       | it to be a harsh critic, but to also give a pass/fail signal, and
       | loop back until the critic judges pass, then send that back to
       | the user as output. Ideally as a little website that can call
       | your own LLM endpoints and save/load/share workflow graphs.
       | 
       | Mistral small 3.1 and gemma 3 feel like the first semi-competent
       | models that can be run locally, but that competence is just a
       | seed, and they still need to be guided with a framework that
       | keeps them on track.
       | 
       | Try giving it python execution in a loop and tell it to explore
       | the world. It'll start trying to download and read news and
       | stuff.
        
         | globalise83 wrote:
         | Have you tried n8n? It allows you to build flows like that -
         | you can run the community version in a Docker container within
         | a few minutes and share the configurations for the flows you
         | have built very easily.
        
           | mecsred wrote:
           | _#_ has to be one of the worst word shortening schemes I've
           | ever seen get widespread. It only works with a very small
           | number of long-lived technologies, in which case they
           | basically just get a nickname, "k8s" "i18n". It does not at
           | all work for larger contexts. You're basically making someone
           | solve a crossword (2 across, 10 letters with two filled in)
           | just to parse your sentence.
        
             | jjj123 wrote:
             | I just googled it and it looks like "n8n" is the name of
             | the service. The op wasn't abbreviating anything so I don't
             | think it's the same phenomenon as what you're describing.
        
               | lgas wrote:
               | Well, the service is doing the same thing though. The
               | part I don't understand is that I assume n8n is short for
               | "Nation" but literally every single person I've seen talk
               | about it on YouTube (which is quite a lot) say "En Eight
               | En" every time.
        
               | nemomarx wrote:
               | nation is too short for 8 - maybe navigation?
        
               | pkaye wrote:
               | Looks like n8n is short for nodemation
        
               | firesteelrain wrote:
               | Why do we do this to ourselves?
        
               | Y_Y wrote:
               | Techno-flagellation is the only way to atone
        
               | oppodeldoc wrote:
               | https://github.com/n8n-io/n8n?tab=readme-ov-file#what-
               | does-n...
        
             | eddieroger wrote:
             | It's just another form of any other jargon - unknown until
             | you know it, and usually specific to the use case. I see
             | k8s and i18n or a11y and I know exactly what they mean
             | because at some point I learned it and it's part of the
             | world I live in. Searching for stuff is how we learn, not
             | solving crosswords.
        
               | mecsred wrote:
               | Right, my complaint is that it only works like jargon,
               | where you are just giving something a context-specific
               | nickname. As a word shortening scheme, it's terrible. A
               | world where many projects have names like s11g is a
               | nightmare.
        
               | wongarsu wrote:
               | I kind of get k8s and can live with i18n (at least it's a
               | long word). But a11y just shouldn't exist. "Oh look, it
               | looks like ally, what a cute play on words". Yeah, but
               | for a dumb joke and 9 saved keystrokes you literally made
               | the word accessibility less accessible. That's exactly
               | the opposite of what accessibility is about
        
           | hnuser123456 wrote:
           | I had not, but that looks awesome. Microsoft put out
           | something called "agent flows" that also fits this
           | category.[1] I'm working on more of an "at home" version - no
           | "talk to sales" button.
           | 
           | https://www.microsoft.com/en-us/microsoft-
           | copilot/blog/copil...
        
         | andai wrote:
         | I am thinking the same thing! Multiple "personalities", in
         | parallel, or in series. For example, I have approximated, in
         | GPT, some of Gemini's ability to call out nonsense, sloppy
         | thinking, by telling GPT to be mean! (The politeness seems to
         | filter out much that is of great value!)
         | 
         | However, the result is not pleasant to read. Gemini solved this
         | in their training, by doing it in two phases... and making the
         | first phase private! ("Thinking.")
         | 
         | So I thought, what I need is a two-phase approach, where that
         | "mean" output gets humanized a little bit. (It gets harsh to
         | work in that way for more than short intervals.)
         | 
         | As a side note, I think there would be great value in a UI that
         | allows a "group chat" of different LLM personalities. I don't
         | know if such a thing exists, but I haven't seen it yet,
         | although the message object format seems to have been designed
         | with it in mind (e.g. every message has a name, to allow for
         | multiple users and multiple AIs).
         | 
         | Even better if it supports multiple providers, since they have
         | different strengths. (It's like getting a second opinion.)
        
           | NitpickLawyer wrote:
           | > As a side note, I think there would be great value in a UI
           | that allows a "group chat" of different LLM personalities.
           | 
           | This is the basic idea behind autogen. They also have a web
           | UI now in autogen studio, it's gotten a bit better. You can
           | create "teams" of agents (with different prompts, themes,
           | tools, etc.) and have them discuss / cooperate. I think they
           | even added memory recently. Have a look at it, might be what
           | you need.
        
           | jbm wrote:
           | I disagree.
           | 
           | If anything, telling GPT to be blunt seems to downgrade its
           | IQ; it hallucinates more and makes statements without
           | considering priors or context. I jokingly call it Reddit
           | mode.
        
             | dingnuts wrote:
             | why would that be a joke? there's a ton of Reddit comments
             | in the training data, and the output is of similar quality.
             | LLMs are literally outputting average Reddit comments.
        
               | inanutshellus wrote:
               | See, he's not joking, he's "joking" ...
        
               | MoonGhost wrote:
               | Reddit works hard to make comments accessible to only
               | Google. However MS + OIA might have grabbed something
               | before Reddit-Google contract.
        
               | jbm wrote:
               | I have hard similar things but I think that's an
               | exaggeration. When I tell GPT o3 or o4-high to assume a
               | professional air, it stops acting like a meat-based AIs
               | on r/politics; specifically, it stops making inane
               | assumptions about the situation and starts becoming
               | useful again.
               | 
               | For example, I had a question from a colleague that made
               | no sense and I was trying to understand it. After feeding
               | the question to GPT 3o, it aggressively told me that I
               | made a major mistake in a quote and I had to make major
               | changes. (It would be OK if this is what the colleague
               | had said, but this wasn't the case). In reality the
               | colleague had misunderstood something about the scope of
               | the project and GPT had picked up on the other person's
               | opinion as the "voice of reason" and just projected what
               | it thought he was saying in a stronger way.
               | 
               | I changed its instructions to "Be direct; but polite,
               | professional and helpful. Make an effort to understand
               | the assumptions underlying your own points and the
               | assumptions made by the user. Offer outside-of-the-box
               | thinking as well if you are being too generic.". The
               | aggro was immediately lost, and it instead it actually
               | tried to clarify what my colleague was saying and being
               | useful again.
               | 
               | I agree with those who say the vanilla version is
               | sycophantic, but the plain talk version has far too many
               | bad habits from the wrong crowd. It's a bit like Monday;
               | lots of aggro, little introspection of assumption.
        
           | theturtletalks wrote:
           | MoE, but an abstraction deeper?
        
         | irthomasthomas wrote:
         | I think you can do most of this already with llm-consortium
         | (maybe needs the llm-openrouter plugin with my pr merging)
         | 
         | A consortium sends the same prompt to multiple models in
         | parallel and the responses are all sent to one arbiter model
         | which judges the model responses. The arbiter decides if more
         | iterations are required. It can also be forced to iterate more
         | until confidence-threshold or min-iterations.
         | 
         | Now, using the pr i made to llm-openrouter, you can save an
         | alias to a model that includes lots of model options. For
         | examples, you can do llm openrouter save -m qwen3 -o online -o
         | temperature 0, system "research prompt" --name qwen-researcher
         | 
         | And now, you can build a consortium where one member is an
         | online research specialist. You could make another uses JSON
         | mode for entity extraction, and a third which writes a blind
         | draft. The arbiter would then make use of all that and
         | synthesize a good answer.
        
           | kridsdale1 wrote:
           | Any links or names of example implementations of this?
        
             | irthomasthomas wrote:
             | https://github.com/irthomasthomas/llm-consortium
             | 
             | also, you aren't limited to cli. When you save a consortium
             | it creates a model. You can then interact with a consortium
             | as if it where a normal model (albeit slower and higher
             | quality). You can then serve your custom models on an
             | openai endpoint and use them with any chat client that
             | supports custom openai endpoints.
             | 
             | The default behaviour is to output just the final
             | synthesis, and this should conform to your user prompt. I
             | recently added the ability to continue conversations with a
             | consortium. In this case it only includes your user prompt
             | and final synthesis in the conversation, so it mimics a
             | normal chat, unlike running multiple iterations in the
             | consortium, where full iteration history and arbiter
             | responses are included.
             | 
             | UV tool install llm
             | 
             | llm install llm-consortium
             | 
             | llm install llm-model-gateway
             | 
             | llm consortium save qwen-gem-sonnet -m qwen3-32b -n 2 -m
             | sonnet-3.7 -m gemini-2.5-pro --arbiter gemini-2.5-flash
             | --confidence-threshold 95 --max-iterations 3
             | 
             | llm serve qwen-gem-sonnet
             | 
             | In this example I used -n 2 on the qwen model since it's so
             | cheap we can include multiple instances of it in a
             | consortium
             | 
             | Gemini flash works well as the arbiter for most prompts.
             | However if your prompt has complex formatting requirements,
             | then embedding that within an already complex consortium
             | prompt often confuses it. In that case use gemini-2.5-pro
             | for the arbiter. .
        
       | Xcelerate wrote:
       | I think this is how we get ML models to come up with novel ideas.
       | Diagonalize against all the ideas they've already tried and
       | dismissed via self-argument but keep certain consistency
       | constraints. (Obviously much easier said than done.)
        
         | andai wrote:
         | What you just said is what I tried and failed to say ten
         | minutes ago!
         | 
         | https://news.ycombinator.com/item?id=43835798
        
           | Nevermark wrote:
           | It's working! Oh, wait ...
           | 
           | These models have limitations obviously, but many critiques
           | apply equally or more to people.
           | 
           | If people were tasked with one shot, 10 second answers, to be
           | written out in near errorless grammar, the LLM's viewing our
           | responses to prompts would be spending a lot of time
           | discussing our limitations and how to game us into better
           | responses. Humor, not at all humor.
        
         | jwally wrote:
         | Scaled up and spread out - this probably gets you pretty close
         | to consciousness(?)
         | 
         | Conway's game of life, but instead of colored squares with
         | rules, they're LLM's with some kind of weighting - all
         | chattering back and forth with one another - bubbling up
         | somehow to cause speach/action
        
           | lubujackson wrote:
           | Decades ago I read The Society of Mind by Marvin Minsky. He
           | pushed this sort of idea, that consciousness is composed of
           | individual, competing processes. Worth a revisit!
        
       | cube2222 wrote:
       | This is really cool!
       | 
       | One strategy I often use (which is much simpler and more limited
       | than this), is to finish my message with: "Please do a round of
       | thinking in <thinking></thinking> tags, then a round of self-
       | critique in <critique></critique> tags, and then a final round of
       | <thinking>, before responding."
       | 
       | It works very well. Similarly just asking it to "find the 5
       | biggest issues with its proposal" works pretty good (the 5
       | forcing it to find _something_ , even if it's mostly irrelevant).
        
         | danielbln wrote:
         | I always do "now again but put on your critical hat"
        
           | CSSer wrote:
           | Makes me wonder how it would do if you tell it "put on your
           | robe and wizard hat"
        
             | tomrod wrote:
             | ChatGPT calls you a superstar and it drops into bruhspeak.
             | Emojis proliferate.
        
             | sumtechguy wrote:
             | it proceeds to spit out the entirety of bash.org
        
         | bentt wrote:
         | Oh I really like that. It makes me want to have it score its
         | ideas with metrics and then keep iterating until it meets some
         | score.
        
         | zoogeny wrote:
         | This is one of the reasons I like the massive context window in
         | Gemini. You can do this as part of the message chain. I don't
         | try to one shot it, just use the same idea across 3 messages.
         | 
         | 1. Figure out a plan (it responds with the plan)
         | 
         | 2. Point out flaws in the plan (it responds with the flaws)
         | 
         | 3. Update the plan to address the flaws (it responds with an up
         | to date plan)
         | 
         | The other things I tend to ask are "what might we be missing?",
         | "what are the [performance|security|legal|cost]
         | considerations?". I can often iterate on the "anything else?"
         | kind of nudging prompts, especially guiding it on topics to
         | consider, for a few messages. After each: update the plan to
         | take those into consideration.
        
       | antisthenes wrote:
       | Cool. Now I can justify talking to myself.
        
       | Garlef wrote:
       | Similarly, letting the LLM generate a socratic dialogue can work
       | pretty well to get deeper into a topic.
        
       | Der_Einzige wrote:
       | Debate as a reasoning tactic is massively undervalued. There's
       | tons of papers on this at places like NeurIPS, ICML, ICLR, etc.
       | 
       | Hell, even a whole quanta article.
       | https://www.quantamagazine.org/debate-may-help-ai-models-con...
       | 
       | I got to meet and talk to the authors of this paper at NeurIPS.
       | They're class acts!
        
       | electroly wrote:
       | This seems to be different than I expected from the title. I
       | thought it would be explicitly adversarial.
       | 
       | 1. You are the assistant. Please answer the question directly.
       | 
       | 2. You are the cross-examiner. The assistant is wrong. Explain
       | why.
       | 
       | 3. You are the assistant. The cross-examiner is wrong. Defend
       | your claim.
       | 
       | 4. You are a judge. Did either party make their case, or is
       | another round of argumentation required?
       | 
       | I haven't tried this. No idea if it works. But I find it's
       | helpful to ask ChatGPT, in separate prompts, "XYZ is true,
       | explain why" and "XYZ is false, explain why" and see which one
       | seems more convincing.
        
         | nonethewiser wrote:
         | Chatgpt shares context between chats. I wonder how that impacts
         | it?
         | 
         | It seems like a good approach though. What you dont want to do
         | is ever suggest that its wrong yourself. Usually it will just
         | assume it is wrong.
         | 
         | Actually what I find impressive is when I do this and it
         | actually pushes back to defend itself.
        
           | the_af wrote:
           | Does it share context even if no "memory updated" message
           | appears indicating it has stored a fact about you?
           | 
           | I asked ChatGPT and it says no, but then again it's not
           | reliable at introspection or at revealing data about how it
           | works.
        
         | 3np wrote:
         | Also a little clickbaity with "my AI" and then it's all
         | Mistral...
        
         | ChadMoran wrote:
         | Check out Fast Agent! (I have no affiliation with it, just use
         | it).
         | 
         | https://github.com/evalstate/fast-agent
        
         | mountainriver wrote:
         | Techniques like this have been around since GPT-3.5. There are
         | boatloads of papers on the topic.
         | 
         | I have no idea why anyone thinks this is novel. I guess that
         | speaks to the state of HN
        
           | moribunda wrote:
           | Exactly... I thought that implementing STORM was just a basic
           | step in this topic... Looks like we're running in circles.
        
       | pkdpic wrote:
       | So glad to see a write up on this finally. I'm no machine
       | learning phd but I always wondered why this wasn't more of a
       | thing. Like an extension of a GAN conceptually, sort of, not
       | really at all Im sure.
       | 
       | Also I think I kind of assumed OpenAI might be doing this behind
       | the curtain?
        
       | K0balt wrote:
       | I'll second this. I often use a "research assistant " and
       | skeptical"department head" personas working together/against each
       | other as a research team. It works well and is occasionally
       | hilarious, replete with the occasional HR complaint when things
       | go off the rails. ( I typically use local uncensored models)
        
       | firgrove wrote:
       | this is amazing - I love seeing novel approaches to optimizing
        
       | joshstrange wrote:
       | I've thought about trying this cross-model as well. Have Claude
       | generate something, have OpenAI check it, have Gemini check that
       | check. Firing multiple of these in parallel.
       | 
       | There was a post here a week or so ago doing the "model checking
       | model"-type thing with GH PRs IIRC that was interesting. I
       | haven't had a chance to play with this idea yet.
        
       | k2xl wrote:
       | I've done something similar for learning about a controversial
       | topic. I ask it to act as if it is called Bob is a well informed
       | supporter of one side (like Ukraine) and then act as if it is
       | something named Alice who is a well informed supporter of another
       | side (Russia) and they have to debate each other over a few
       | prompts with a moderator named 'Sue'
       | 
       | Then after a few rounds of the debate where Sue asks a bunch of
       | questions, I ask it to go to the judges - Mark, Phil, Sarah (and
       | I add a few personalities to each of them... Sometimes I pretend
       | they are famous moral philosophers) and then I have them each
       | come up with a rubric and decide who is the winner.
       | 
       | Really fun, and helps me understand different sides of issues.
        
         | rat87 wrote:
         | That seems like a terrible idea. At best it seems likely to
         | help you make a false but convincing sounding case. I really
         | hope no one is using that to help them understand controversial
         | topics much less using that to determine their stances.
         | 
         | Id recommend looking into actual human experts who are
         | trustworthy and reading them. Trying to get LLM to argue the
         | case will just get you a lot of false information presented in
         | a more convincing fashion
        
       | alexmolas wrote:
       | There are two examples in the repo, one with CoRT and another one
       | without. And the one without it it's much better than the one
       | that uses it. Weird choice of examples...
        
         | 2cheeze4u wrote:
         | I think the names were switched up.
        
       | irthomasthomas wrote:
       | my favourite pattern rn: llm "write a savage, yet grounded roast
       | of: $content" llm -c "Write an equally savage rebuttal" llm -c
       | "first arbitrate and then synthesize a final review."
        
       | getcrunk wrote:
       | Hello cnn's
        
       | m3kw9 wrote:
       | Isn't this best of n?
        
       | jedberg wrote:
       | We're really going to need to figure out how to power all these
       | GPUs with green power real quick, or we're going to melt the
       | planet having AIs debate with themselves on the optimal solution
       | to tik-tac-toe...
        
         | nonethewiser wrote:
         | Ive felt this way when using chatgpt for a simple search. Stuff
         | that google could handle but would just be slower, mostly from
         | me having to manually filter.
         | 
         | Sometimes its the easiest way to complete a very small task but
         | the cost difference on the backend has to be pretty damn large.
         | The user inevitably ends up not caring whatsoever. Its just not
         | real to them.
        
         | ivape wrote:
         | I caught infra people saying that's pretty much the only
         | bottleneck in the data center right now, power and cooling. We
         | know the AI needs to run against itself continuously, and
         | that's just a fact.
        
       | mparnisari wrote:
       | So like rubber ducking for AI?
        
         | 1970-01-01 wrote:
         | "While hallucinating a duck, check my script for errors."
        
         | z2 wrote:
         | I would really like to see a fusion guidebook of mental tricks
         | that work for humans and just as well for AI. Or humorously,
         | perhaps prompt-engineering tricks that are also great mental
         | hacks for better or clearer human thinking.
        
       | WhitneyLand wrote:
       | Why try this idea on base models only?
       | 
       | The whole point of reasoning models is to automatically use COT
       | and related techniques to bring out more capabilities.
       | 
       | It would be interesting to see if this is doing anything that's
       | not already being exploited.
        
       | Lerc wrote:
       | I kind of want to try something like this at a larger scale in an
       | always-on mode where I have a 'senate' of debate. Rather than
       | responding to prompts on a case by case basis, provide a list of
       | tasks (potentially with deadlines) and let the senate work on
       | them, break off into groups to manage subtasks, challenge results
       | , make suggestions. Even potentially a tree of analysts where
       | suggestions only gets passed up the tree when the parent node
       | thinks a lower analysis is particularly insightful.
       | 
       | I definitely think that directing models to approach a problem
       | from a specific perspective can generate better or worse results.
       | Creating a diverse set of perspectives along with critical
       | analysis of their results should be able to produce some
       | impressive results.
       | 
       | Things like this would generate a massive number of tokens, but
       | the cost per token is definitely heading in the right direction
       | to allow for this. There is also the possibility of setting up an
       | AI only IRC server where anybody can connect their own models for
       | a shared debating chamber.
        
         | nonethewiser wrote:
         | In theory couldnt this just be baked into a single adversarial
         | model?
        
           | tonmoy wrote:
           | Yes, but I guess the model is optimized for relatively quick
           | response, whereas these techniques are allowing the model to
           | spend more time to generate a higher quality response
        
           | Lerc wrote:
           | To an extent, but different models are better at different
           | things.
           | 
           | That is something I'm also curious about. Given models (that
           | use the same tokenisation) that are better at different
           | things, would their be interesting things to find by
           | analysing the logprobs for tokens generated from identical
           | inputs (including cross feeding the generated token from one
           | to another)
           | 
           | Surely there must be something notable at particular points
           | when a model goes off on the wrong path.
        
         | mikepurvis wrote:
         | In doing some DevOps-y type tasks recently (ansible, packer,
         | docker, baking images with guestfish), I've found it very
         | frustrating how much ChatGPT will confidently tell me to use
         | flags on tools that don't exist, or hallicinate completely non-
         | existent functions or behaviours. And then when I spend time
         | trying what it suggests only to hit a wall and come back like
         | wtf mate it breezily goes "oh yes so you're right, good job
         | figuring that out! You're so close now! Your next step is to do
         | X and Y," and then serves up the same detailed tutorial as
         | before but with the flag or whatever it was that it had wrong
         | subtly changed.
         | 
         | It definitely makes me feel like I'm dealing with an
         | overenthusiastic intern who is throwing stuff over the wall
         | without checking their work, and like maybe having a second bot
         | sitting in front of the first one being like ARE YOUR SURE
         | ABOUT THAT could really improve things.
        
           | organsnyder wrote:
           | I've enjoyed watching Claude try running commands with
           | incorrect flags, trying them, and then adapting.
        
           | nonelog wrote:
           | Spot on.
        
           | vunderba wrote:
           | 100%. This has happened enough to me that I wished I could
           | just inject the man page docs into it to at least act as a
           | sanity check.
        
           | 0x20cowboy wrote:
           | I did a stint in Devops and I found every models to be like
           | this for all of the infra-as-code languages. Anything yaml
           | based was especially bad.
           | 
           | Even Amazon's own offering completely made things up about
           | Amazon's own formats.
           | 
           | I'd be curious as to why that is. It seems like there would
           | be enough training data, and for Amazon in particular it
           | seems like they could make a validation tool the model could
           | use.
        
             | mikepurvis wrote:
             | Maybe I'm excessively anthropomorphizing, but it does feel
             | a bit analogous to my own thought process, like "I need
             | feature XYZ, and based on other tools I'm more familiar
             | with it should be an --xyz flag, so let me google for that
             | and see if I'm right or if I instead find a four-year-old
             | wontfix on Github where someone asked for what I need and
             | got denied."
             | 
             | Except... the model is missing that final step; instead it
             | just belches out its hypothesis, all dressed up in chirpy,
             | confident-sounding language, certain that I'm moments away
             | from having everything working just perfectly.
        
           | MoonGhost wrote:
           | You can't get more info from LLMs than it actually holds.
           | Like Anthropic pointed if LLMs knows the name but has no
           | other info it starts hallucinating. The same probably happens
           | here. LLM knows there must be a flag but can't remember all
           | of them. Likely short reminder in prompt will help. (or
           | search web for GPT) Just my $0.02.
        
             | mikepurvis wrote:
             | It certainly feels like you can just by challenging it;
             | then it happily finds other paths to what you want. So
             | maybe internally it needs a second voice encouraging it to
             | think harder about alternatives upfront.
        
         | crowcroft wrote:
         | Like, just endlessly grinding tokens, then processing the
         | output and pulling out good ideas when the endless debate
         | generates them?
         | 
         | Would be interesting what it comes up with with enough time and
         | tokens.
        
         | vunderba wrote:
         | A year or so ago I experimented with splitting a user prompt
         | down to a set of "different AI personas" that would each try to
         | approach the user's problem in a different way and then bubble
         | back up with a master arbiter for consensus.
         | 
         | I modeled it after the concept of advisors from Civilization
         | II. It worked reasonably well though I think it was at least
         | somewhat limited by being constrained to a single LLM
         | (Mistral). It also lit my computer on fire.
        
           | bee_rider wrote:
           | What sort of personalities did you try? A group where some
           | members have grudges against each other and will irrationally
           | poke holes in each other's plans could be a fun experiment.
        
             | throwup238 wrote:
             | With multiple groups with external and internal rivalries.
             | The Always Sunny gang versus The IT Crowd.
        
         | danielmarkbruce wrote:
         | This is being done, and you could apply it to a lot of domains.
         | Go for it for whatever use case you have.
        
         | taneq wrote:
         | A society of mind, if you will. :)
         | 
         | This sounds like a fun thing to set up with a quick-enough
         | local model.
        
       | csours wrote:
       | Yes, give the computers anxiety too!
        
       | lenerdenator wrote:
       | I, too, like to give Terminator lite anxiety.
        
       | lepisma wrote:
       | Debates have worked good for me while learning something new:
       | 
       | https://lepisma.xyz/2024/10/19/interventional-debates-for-st...
       | 
       | I believe there are researches on this too.
        
       | mritchie712 wrote:
       | Did something similar (OverkiLLM) to this waayyyy back in August
       | with open LLMs. I'm sure it'd work much better now:
       | 
       | https://www.definite.app/blog/overkillm
        
       | daxfohl wrote:
       | Maybe have a "reconcile" option, for it to see if it can mix and
       | match the best parts of each alternative rather than just
       | choosing one.
        
       | grzracz wrote:
       | Your readme demo images are wrong: the terminal one is the non-
       | CoRT one and the GUI one is the one with CoRT. Confused me for a
       | while
        
       | noworriesnate wrote:
       | I've had success telling the model it really needs to poop and if
       | it gets to the point quickly it'll be able to leave the meeting
       | and go do that. It actually works amazingly well.
       | 
       | It's also a lot more ethical than verbal abuse, which some people
       | say improves the results as well.
       | 
       | Programming isn't what it used to be.
        
         | tinix wrote:
         | this works for getting out of traffic tickets too lol
        
       | thunderbong wrote:
       | A lot of the comments here are reminiscent of the early Google
       | days when everyone was finding ways to search better!
        
       | caseyy wrote:
       | I tried something similar when Llama2 came out, pitting two
       | assistants, who each believed the other is the user, against each
       | other. Ultimately, it was the same model talking with itself. The
       | system prompts for both had various instructions to disagree and
       | criticise the opinion of the user. I provided the first message
       | to get things started. Usually, it's be along the lines of
       | "nuclear proliferation is harmful to humanity".
       | 
       | After 15 or so iterations, both assistants would keep repeating
       | the same things and find agreement anyway. Sometimes, the chat
       | became unhinged and useless, but 95/100 times, it was agreement.
       | 
       | Happy someone else made it work.
        
         | generalizations wrote:
         | I always assumed you'd have to use different models. Even if
         | only one of them is large, the others would inject enough
         | difference of opinion to keep it useful.
        
         | zamalek wrote:
         | This might be a situation that warrants a higher temperature.
         | Actually, it could be worth starting a very high temperature
         | initially and gradually decreasing it.
        
       | throwawayForMe2 wrote:
       | I wonder if the Scholastic method of the Schoolmen would be
       | useful with its argument and counter argument style.
        
       | odo1242 wrote:
       | Something I do sometimes is:
       | 
       | - Have an AI chat model come up with an answer to a problem.
       | 
       | - Have it write a report discussing the details of the problem
       | and why it's answer is correct, directed at a person or AI model
       | who has no knowledge of the initial problem or technical field.
       | 
       | - Have a second AI model with no knowledge of the problem grade
       | the report, and write it's own report either (a) asking for
       | clarification / more information about the problem that the
       | original model didn't provide or (b) pointing out an
       | inconsistency in the argument posed by the original model. Give
       | this report back to the original model and ask it to write it's
       | own report back with either the necessary information or changes.
       | 
       | - Repeat until either the second AI model is convinced by the
       | first AI model's explanation or the first AI model has
       | implemented all the changes requested by the second AI model.
       | 
       | It's super clunky but has given pretty good results in the cases
       | where I tried it lol
        
         | hsuduebc2 wrote:
         | We're there any situation that first conclusion from AI was
         | completely changed? Can you give generally examples of
         | situations where it changed or significantly improved overall
         | result? It sounds cool.
        
           | nomel wrote:
           | I would be interested to know how ofter "oscillations" occur,
           | where they flip flop from being too "agreeable" to challenges
           | (which probably is just a sparse latent space). This happens
           | to me pretty frequently, where you can repeatedly say "no
           | that's wrong" and the LLM will do a 180, explaining why it
           | was "in fact" wrong and you are "right", repeat.
        
         | JumpCrisscross wrote:
         | Kagi's Assistant feature makes this super easy. Just switch
         | assistants and ask them to check the other's work.
        
         | StopDisinfo910 wrote:
         | For anything semi-adversarial, I have had good results asking
         | the AI to come up with a plan, then take the side of the
         | opponent coming with counter play/way to defeat the plan,
         | finally asking for a revision of the initial plan given the
         | potential reaction from the opponent.
         | 
         | The final plan you obtain is generally a lot more well rounded
         | and thought out.
         | 
         | I find that amusing because the technique also works when I
         | apply it to me. Picking flaws in your plan before revisiting it
         | actually works.
        
         | itissid wrote:
         | Isn't this kind of another way of how Inference Time Scaling
         | works? It will basically produce several chain of thoughts and
         | then pursue one that has maximum reward based on an internal
         | function?
        
         | jsight wrote:
         | This reminds me a lot of the YT video that went over using
         | Monte Carlo Tree Search with LLMs to maximize result quality.
         | Link:
         | https://www.youtube.com/watch?v=mfAV_bigdRA&ab_channel=Treli...
         | 
         | It seemed like a pretty good idea, though I'd guess that it
         | would greatly increase token usage. I'd also be concerned that
         | the LLM as a judge might struggle to grade things accurately if
         | it wasn't also able to generate good enough answers to begin
         | with.
        
         | subscribed wrote:
         | I do it all the time in Sillytavern in a group chat - three
         | characters kind of resembling what you just described, and me,
         | participating in the "conversation", them going back and forth
         | until they're satisfied.
         | 
         | With a good model role playing them, works awesome.
        
         | zoogeny wrote:
         | I do the same, and I have one other technique.
         | 
         | I will often have a few chats going for a project, but with
         | different contexts. For example, one might be tech focused,
         | another marketing focused, another with some context on my
         | personal goals, etc.
         | 
         | So I will take the same question and feed it into the chats
         | with differing context. It is almost like having different
         | perspectives on the same problem. And the conclusions can often
         | differ based on the differing contexts.
        
       | ChadMoran wrote:
       | Fast Agent has this as a first-class citizen called "Evaluator
       | Optimizer" pattern. Where it in a loop with a defined number of
       | max refinements judge itself and give the output a rating,
       | demanding it improve it's output.
       | 
       | Highly encourage others to check out Fast Agent. It has been
       | delightful to use. It has interactive chat mode which I love and
       | it's really tight and easy to implement.
       | 
       | https://github.com/evalstate/fast-agent
        
       | celltalk wrote:
       | One of my doctoral propositions is, dialog leads to true
       | artificial intelligence.
        
       | badmonster wrote:
       | Have you experimented with weighting the self-evaluations based
       | on specific criteria (e.g., correctness, clarity, creativity), or
       | using external validators to guide the AI's final choice? Curious
       | how much tuning the evaluation step impacts overall performance.
        
       | j45 wrote:
       | There appear to be no shortage of token saving attempts that can
       | end up using more tokens, whether it's a monthly paid plan or
       | API.
       | 
       | Having an approach to recognize what is needed from the AI
       | software, and anticipate how it may default to respond based on
       | it's programming is critical.
        
       | yieldcrv wrote:
       | Reminds me of baby agi from 2 years ago
       | 
       | but I guess that was before chain of thought models
        
       | DyslexicAtheist wrote:
       | > _" I made my AI think"_ ...
       | 
       | utterly moronic.
       | 
       | They don't "think" ... not even in the most autistic sense of the
       | word.
       | 
       | They can generate solutions by combining existing knowledge in
       | unique ways. But they don't "think".
        
       | bilekas wrote:
       | This is an interesting approach, it reminds me of YT creator
       | actually. I'll find the YT creator, but basically he would make
       | some script that would play the game like a race-course, with the
       | goal being the finish line and iterate it N number of times, the
       | script would keep iterating until it found the fastest solution.
       | 
       | I believe they called that machine learning.. Or re-enforced
       | training.
       | 
       | I'm being slightly facetious, but my ignorant understanding of AI
       | these days is basically the same no ?
       | 
       | https://www.youtube.com/watch?v=SX08NT55YhA
        
       | cwillu wrote:
       | Any api that lets you constrain output to a formal syntax should
       | let you do away with the "first output a number, and only then
       | explain yourself" boilerplate.
        
       | hu3 wrote:
       | Here's some related challenge I'm facing. Maybe someone can help
       | me:
       | 
       | I also managed to make AI critique itself and that improved code
       | generation a ton.
       | 
       | For a TypeScript backend project that runs with Bun, I tell AI to
       | also generate and run unit tests after every code change
       | suggested by AI.
       | 
       | How do you solve the risk of AI writting and executing unit tests
       | with something like `rm -rf /` and wiping your files?
       | 
       | Docker works but I like to keep things simple.
       | 
       | Deno supports revoking file access but I'd like to keep using
       | Bun.
        
         | zactato wrote:
         | Either you trust AI or you don't? If you don't trust it then
         | you need to review what it's writing.
         | 
         | Docker seems like a pretty low complexity way to create an
         | isolated environment to run automation.
        
         | derwiki wrote:
         | Manually approve every terminal command it wants to run instead
         | of vibe mode. Tbh I think an rm -rf scenario is exceedingly
         | unlikely.
        
         | ivape wrote:
         | a) You should only do this in a sandbox
         | 
         | b) You can have the AI run a "firewall" prompt on the final
         | output. So your final output should go through a "You are a
         | firewall that checks for dangerous terminal commands such as
         | <enumerate list of dangerous commands>. If you spot dangerous
         | commands, reform the command so that it is not dangerous"
        
       | albertgoeswoof wrote:
       | How far is this going to go? Are we going to have a team of AI
       | agents that runs a scrum team and meets for stand ups every
       | couple of hours?
       | 
       | Are we going to replicate government bureaucracy with agents all
       | debating topics all day long to find the best opinion?
        
         | parrit wrote:
         | Maybe. Humans form teams for a reason. Yes there are different
         | exepriences and points of view in a human (vs. Not so much in
         | LLM), but sometimes a different hat it all it takes. E.g. Code
         | reviewer vs. Coder.
        
       | jbellis wrote:
       | does it actually make a difference to do M rounds of N vs one
       | round of M*N?
        
       | asdfman123 wrote:
       | And when I do this people say I'm overanalyzing
        
         | ivape wrote:
         | The thing that makes _us_ weird to regular people is what 's
         | going to make us uniquely positioned to utilize AI. If people
         | only _knew_ the level at which I overanalyze and entertain
         | weird ideas. I always inject these personality quirks into my
         | instructions and get very creative results. In a weird way, I
         | 'm starting to appreciate just how weird I actually am.
        
       | killerstorm wrote:
       | This is similar to Tree-of-Thought with self-evaluation.
        
       | zekenie wrote:
       | I feel like itd be cool to try prompts based on an adversarial
       | justice system... attorney agents arguing both sides, a judge
       | ruling on "the law"--adherence to instructions etc
        
         | ivape wrote:
         | That's very easy to do. A prompt I regularly use is a "council"
         | system. For example:
         | 
         | "I believe I have been contacted by the supernatural. Here are
         | the details <details>. Please form a council of seven people:
         | 1) Secular scientist 2) Religious scientist 3) Paranormal
         | historian 4) Secular Psychologist 5) Religious psychologist 6)
         | Carl Jung 7) Richard Dawkins. The council should all be
         | independent and provide their own objective analysis. Please
         | have them create a final report and conclusions at the end".
         | 
         | Your council can be anything, a law firm, a jury, a parent
         | teacher association, whatever you want, and as you can see, you
         | can throw in known people as well. This can all be done with
         | one prompt. It's one my favorite things to do.
        
       | parrit wrote:
       | I want to see "Meh" vs. "Holy crap" as a benchmark in a paper
       | published by Google. Or more likely I suspect, Andrej.
        
       | hansmayer wrote:
       | Right, so... but you do realise its still just producing random
       | output based on how you reconfigured it's weights, right?
       | Sometimes it will happen to resonate with what you need. But it
       | still neither thinking nor arguing with itself.
        
       | stormfather wrote:
       | I made a trading bot that ingested news. The prompt to assess
       | impact was to simulate a debate between Charlie Munger and Warren
       | Buffet on whether to invest.
        
         | internetter wrote:
         | How did it do?
        
       | Svoka wrote:
       | Oh. I was just asking "Use dialectic method on your solution" in
       | the end of the prompt... It does make it think harder.
        
       | gnarlouse wrote:
       | This seems like low hanging fruit; are we seriously supposed to
       | believe this is new and novel?
        
       | ashoeafoot wrote:
       | Give it reward and punishment evaluations, exploring the noise in
       | parallel, extinction for the non rewarding answers ?
        
       ___________________________________________________________________
       (page generated 2025-04-29 23:00 UTC)