[HN Gopher] Chain of Recursive Thoughts: Make AI think harder by...
___________________________________________________________________
Chain of Recursive Thoughts: Make AI think harder by making it
argue with itself
Author : miles
Score : 302 points
Date : 2025-04-29 17:19 UTC (5 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| hnuser123456 wrote:
| I'm having a lot of fun experimenting with stuff like this. I'm
| trying to put together an unrealengine blueprints style graph
| editor to allow people to design workflows like this where you
| start with the user prompt input, which goes to one agent, which
| makes an initial attempt, and then that conversation history gets
| passed to another "agent" with a different system prompt telling
| it to be a harsh critic, but to also give a pass/fail signal, and
| loop back until the critic judges pass, then send that back to
| the user as output. Ideally as a little website that can call
| your own LLM endpoints and save/load/share workflow graphs.
|
| Mistral small 3.1 and gemma 3 feel like the first semi-competent
| models that can be run locally, but that competence is just a
| seed, and they still need to be guided with a framework that
| keeps them on track.
|
| Try giving it python execution in a loop and tell it to explore
| the world. It'll start trying to download and read news and
| stuff.
| globalise83 wrote:
| Have you tried n8n? It allows you to build flows like that -
| you can run the community version in a Docker container within
| a few minutes and share the configurations for the flows you
| have built very easily.
| mecsred wrote:
| _#_ has to be one of the worst word shortening schemes I've
| ever seen get widespread. It only works with a very small
| number of long-lived technologies, in which case they
| basically just get a nickname, "k8s" "i18n". It does not at
| all work for larger contexts. You're basically making someone
| solve a crossword (2 across, 10 letters with two filled in)
| just to parse your sentence.
| jjj123 wrote:
| I just googled it and it looks like "n8n" is the name of
| the service. The op wasn't abbreviating anything so I don't
| think it's the same phenomenon as what you're describing.
| lgas wrote:
| Well, the service is doing the same thing though. The
| part I don't understand is that I assume n8n is short for
| "Nation" but literally every single person I've seen talk
| about it on YouTube (which is quite a lot) say "En Eight
| En" every time.
| nemomarx wrote:
| nation is too short for 8 - maybe navigation?
| pkaye wrote:
| Looks like n8n is short for nodemation
| firesteelrain wrote:
| Why do we do this to ourselves?
| Y_Y wrote:
| Techno-flagellation is the only way to atone
| oppodeldoc wrote:
| https://github.com/n8n-io/n8n?tab=readme-ov-file#what-
| does-n...
| eddieroger wrote:
| It's just another form of any other jargon - unknown until
| you know it, and usually specific to the use case. I see
| k8s and i18n or a11y and I know exactly what they mean
| because at some point I learned it and it's part of the
| world I live in. Searching for stuff is how we learn, not
| solving crosswords.
| mecsred wrote:
| Right, my complaint is that it only works like jargon,
| where you are just giving something a context-specific
| nickname. As a word shortening scheme, it's terrible. A
| world where many projects have names like s11g is a
| nightmare.
| wongarsu wrote:
| I kind of get k8s and can live with i18n (at least it's a
| long word). But a11y just shouldn't exist. "Oh look, it
| looks like ally, what a cute play on words". Yeah, but
| for a dumb joke and 9 saved keystrokes you literally made
| the word accessibility less accessible. That's exactly
| the opposite of what accessibility is about
| hnuser123456 wrote:
| I had not, but that looks awesome. Microsoft put out
| something called "agent flows" that also fits this
| category.[1] I'm working on more of an "at home" version - no
| "talk to sales" button.
|
| https://www.microsoft.com/en-us/microsoft-
| copilot/blog/copil...
| andai wrote:
| I am thinking the same thing! Multiple "personalities", in
| parallel, or in series. For example, I have approximated, in
| GPT, some of Gemini's ability to call out nonsense, sloppy
| thinking, by telling GPT to be mean! (The politeness seems to
| filter out much that is of great value!)
|
| However, the result is not pleasant to read. Gemini solved this
| in their training, by doing it in two phases... and making the
| first phase private! ("Thinking.")
|
| So I thought, what I need is a two-phase approach, where that
| "mean" output gets humanized a little bit. (It gets harsh to
| work in that way for more than short intervals.)
|
| As a side note, I think there would be great value in a UI that
| allows a "group chat" of different LLM personalities. I don't
| know if such a thing exists, but I haven't seen it yet,
| although the message object format seems to have been designed
| with it in mind (e.g. every message has a name, to allow for
| multiple users and multiple AIs).
|
| Even better if it supports multiple providers, since they have
| different strengths. (It's like getting a second opinion.)
| NitpickLawyer wrote:
| > As a side note, I think there would be great value in a UI
| that allows a "group chat" of different LLM personalities.
|
| This is the basic idea behind autogen. They also have a web
| UI now in autogen studio, it's gotten a bit better. You can
| create "teams" of agents (with different prompts, themes,
| tools, etc.) and have them discuss / cooperate. I think they
| even added memory recently. Have a look at it, might be what
| you need.
| jbm wrote:
| I disagree.
|
| If anything, telling GPT to be blunt seems to downgrade its
| IQ; it hallucinates more and makes statements without
| considering priors or context. I jokingly call it Reddit
| mode.
| dingnuts wrote:
| why would that be a joke? there's a ton of Reddit comments
| in the training data, and the output is of similar quality.
| LLMs are literally outputting average Reddit comments.
| inanutshellus wrote:
| See, he's not joking, he's "joking" ...
| MoonGhost wrote:
| Reddit works hard to make comments accessible to only
| Google. However MS + OIA might have grabbed something
| before Reddit-Google contract.
| jbm wrote:
| I have hard similar things but I think that's an
| exaggeration. When I tell GPT o3 or o4-high to assume a
| professional air, it stops acting like a meat-based AIs
| on r/politics; specifically, it stops making inane
| assumptions about the situation and starts becoming
| useful again.
|
| For example, I had a question from a colleague that made
| no sense and I was trying to understand it. After feeding
| the question to GPT 3o, it aggressively told me that I
| made a major mistake in a quote and I had to make major
| changes. (It would be OK if this is what the colleague
| had said, but this wasn't the case). In reality the
| colleague had misunderstood something about the scope of
| the project and GPT had picked up on the other person's
| opinion as the "voice of reason" and just projected what
| it thought he was saying in a stronger way.
|
| I changed its instructions to "Be direct; but polite,
| professional and helpful. Make an effort to understand
| the assumptions underlying your own points and the
| assumptions made by the user. Offer outside-of-the-box
| thinking as well if you are being too generic.". The
| aggro was immediately lost, and it instead it actually
| tried to clarify what my colleague was saying and being
| useful again.
|
| I agree with those who say the vanilla version is
| sycophantic, but the plain talk version has far too many
| bad habits from the wrong crowd. It's a bit like Monday;
| lots of aggro, little introspection of assumption.
| theturtletalks wrote:
| MoE, but an abstraction deeper?
| irthomasthomas wrote:
| I think you can do most of this already with llm-consortium
| (maybe needs the llm-openrouter plugin with my pr merging)
|
| A consortium sends the same prompt to multiple models in
| parallel and the responses are all sent to one arbiter model
| which judges the model responses. The arbiter decides if more
| iterations are required. It can also be forced to iterate more
| until confidence-threshold or min-iterations.
|
| Now, using the pr i made to llm-openrouter, you can save an
| alias to a model that includes lots of model options. For
| examples, you can do llm openrouter save -m qwen3 -o online -o
| temperature 0, system "research prompt" --name qwen-researcher
|
| And now, you can build a consortium where one member is an
| online research specialist. You could make another uses JSON
| mode for entity extraction, and a third which writes a blind
| draft. The arbiter would then make use of all that and
| synthesize a good answer.
| kridsdale1 wrote:
| Any links or names of example implementations of this?
| irthomasthomas wrote:
| https://github.com/irthomasthomas/llm-consortium
|
| also, you aren't limited to cli. When you save a consortium
| it creates a model. You can then interact with a consortium
| as if it where a normal model (albeit slower and higher
| quality). You can then serve your custom models on an
| openai endpoint and use them with any chat client that
| supports custom openai endpoints.
|
| The default behaviour is to output just the final
| synthesis, and this should conform to your user prompt. I
| recently added the ability to continue conversations with a
| consortium. In this case it only includes your user prompt
| and final synthesis in the conversation, so it mimics a
| normal chat, unlike running multiple iterations in the
| consortium, where full iteration history and arbiter
| responses are included.
|
| UV tool install llm
|
| llm install llm-consortium
|
| llm install llm-model-gateway
|
| llm consortium save qwen-gem-sonnet -m qwen3-32b -n 2 -m
| sonnet-3.7 -m gemini-2.5-pro --arbiter gemini-2.5-flash
| --confidence-threshold 95 --max-iterations 3
|
| llm serve qwen-gem-sonnet
|
| In this example I used -n 2 on the qwen model since it's so
| cheap we can include multiple instances of it in a
| consortium
|
| Gemini flash works well as the arbiter for most prompts.
| However if your prompt has complex formatting requirements,
| then embedding that within an already complex consortium
| prompt often confuses it. In that case use gemini-2.5-pro
| for the arbiter. .
| Xcelerate wrote:
| I think this is how we get ML models to come up with novel ideas.
| Diagonalize against all the ideas they've already tried and
| dismissed via self-argument but keep certain consistency
| constraints. (Obviously much easier said than done.)
| andai wrote:
| What you just said is what I tried and failed to say ten
| minutes ago!
|
| https://news.ycombinator.com/item?id=43835798
| Nevermark wrote:
| It's working! Oh, wait ...
|
| These models have limitations obviously, but many critiques
| apply equally or more to people.
|
| If people were tasked with one shot, 10 second answers, to be
| written out in near errorless grammar, the LLM's viewing our
| responses to prompts would be spending a lot of time
| discussing our limitations and how to game us into better
| responses. Humor, not at all humor.
| jwally wrote:
| Scaled up and spread out - this probably gets you pretty close
| to consciousness(?)
|
| Conway's game of life, but instead of colored squares with
| rules, they're LLM's with some kind of weighting - all
| chattering back and forth with one another - bubbling up
| somehow to cause speach/action
| lubujackson wrote:
| Decades ago I read The Society of Mind by Marvin Minsky. He
| pushed this sort of idea, that consciousness is composed of
| individual, competing processes. Worth a revisit!
| cube2222 wrote:
| This is really cool!
|
| One strategy I often use (which is much simpler and more limited
| than this), is to finish my message with: "Please do a round of
| thinking in <thinking></thinking> tags, then a round of self-
| critique in <critique></critique> tags, and then a final round of
| <thinking>, before responding."
|
| It works very well. Similarly just asking it to "find the 5
| biggest issues with its proposal" works pretty good (the 5
| forcing it to find _something_ , even if it's mostly irrelevant).
| danielbln wrote:
| I always do "now again but put on your critical hat"
| CSSer wrote:
| Makes me wonder how it would do if you tell it "put on your
| robe and wizard hat"
| tomrod wrote:
| ChatGPT calls you a superstar and it drops into bruhspeak.
| Emojis proliferate.
| sumtechguy wrote:
| it proceeds to spit out the entirety of bash.org
| bentt wrote:
| Oh I really like that. It makes me want to have it score its
| ideas with metrics and then keep iterating until it meets some
| score.
| zoogeny wrote:
| This is one of the reasons I like the massive context window in
| Gemini. You can do this as part of the message chain. I don't
| try to one shot it, just use the same idea across 3 messages.
|
| 1. Figure out a plan (it responds with the plan)
|
| 2. Point out flaws in the plan (it responds with the flaws)
|
| 3. Update the plan to address the flaws (it responds with an up
| to date plan)
|
| The other things I tend to ask are "what might we be missing?",
| "what are the [performance|security|legal|cost]
| considerations?". I can often iterate on the "anything else?"
| kind of nudging prompts, especially guiding it on topics to
| consider, for a few messages. After each: update the plan to
| take those into consideration.
| antisthenes wrote:
| Cool. Now I can justify talking to myself.
| Garlef wrote:
| Similarly, letting the LLM generate a socratic dialogue can work
| pretty well to get deeper into a topic.
| Der_Einzige wrote:
| Debate as a reasoning tactic is massively undervalued. There's
| tons of papers on this at places like NeurIPS, ICML, ICLR, etc.
|
| Hell, even a whole quanta article.
| https://www.quantamagazine.org/debate-may-help-ai-models-con...
|
| I got to meet and talk to the authors of this paper at NeurIPS.
| They're class acts!
| electroly wrote:
| This seems to be different than I expected from the title. I
| thought it would be explicitly adversarial.
|
| 1. You are the assistant. Please answer the question directly.
|
| 2. You are the cross-examiner. The assistant is wrong. Explain
| why.
|
| 3. You are the assistant. The cross-examiner is wrong. Defend
| your claim.
|
| 4. You are a judge. Did either party make their case, or is
| another round of argumentation required?
|
| I haven't tried this. No idea if it works. But I find it's
| helpful to ask ChatGPT, in separate prompts, "XYZ is true,
| explain why" and "XYZ is false, explain why" and see which one
| seems more convincing.
| nonethewiser wrote:
| Chatgpt shares context between chats. I wonder how that impacts
| it?
|
| It seems like a good approach though. What you dont want to do
| is ever suggest that its wrong yourself. Usually it will just
| assume it is wrong.
|
| Actually what I find impressive is when I do this and it
| actually pushes back to defend itself.
| the_af wrote:
| Does it share context even if no "memory updated" message
| appears indicating it has stored a fact about you?
|
| I asked ChatGPT and it says no, but then again it's not
| reliable at introspection or at revealing data about how it
| works.
| 3np wrote:
| Also a little clickbaity with "my AI" and then it's all
| Mistral...
| ChadMoran wrote:
| Check out Fast Agent! (I have no affiliation with it, just use
| it).
|
| https://github.com/evalstate/fast-agent
| mountainriver wrote:
| Techniques like this have been around since GPT-3.5. There are
| boatloads of papers on the topic.
|
| I have no idea why anyone thinks this is novel. I guess that
| speaks to the state of HN
| moribunda wrote:
| Exactly... I thought that implementing STORM was just a basic
| step in this topic... Looks like we're running in circles.
| pkdpic wrote:
| So glad to see a write up on this finally. I'm no machine
| learning phd but I always wondered why this wasn't more of a
| thing. Like an extension of a GAN conceptually, sort of, not
| really at all Im sure.
|
| Also I think I kind of assumed OpenAI might be doing this behind
| the curtain?
| K0balt wrote:
| I'll second this. I often use a "research assistant " and
| skeptical"department head" personas working together/against each
| other as a research team. It works well and is occasionally
| hilarious, replete with the occasional HR complaint when things
| go off the rails. ( I typically use local uncensored models)
| firgrove wrote:
| this is amazing - I love seeing novel approaches to optimizing
| joshstrange wrote:
| I've thought about trying this cross-model as well. Have Claude
| generate something, have OpenAI check it, have Gemini check that
| check. Firing multiple of these in parallel.
|
| There was a post here a week or so ago doing the "model checking
| model"-type thing with GH PRs IIRC that was interesting. I
| haven't had a chance to play with this idea yet.
| k2xl wrote:
| I've done something similar for learning about a controversial
| topic. I ask it to act as if it is called Bob is a well informed
| supporter of one side (like Ukraine) and then act as if it is
| something named Alice who is a well informed supporter of another
| side (Russia) and they have to debate each other over a few
| prompts with a moderator named 'Sue'
|
| Then after a few rounds of the debate where Sue asks a bunch of
| questions, I ask it to go to the judges - Mark, Phil, Sarah (and
| I add a few personalities to each of them... Sometimes I pretend
| they are famous moral philosophers) and then I have them each
| come up with a rubric and decide who is the winner.
|
| Really fun, and helps me understand different sides of issues.
| rat87 wrote:
| That seems like a terrible idea. At best it seems likely to
| help you make a false but convincing sounding case. I really
| hope no one is using that to help them understand controversial
| topics much less using that to determine their stances.
|
| Id recommend looking into actual human experts who are
| trustworthy and reading them. Trying to get LLM to argue the
| case will just get you a lot of false information presented in
| a more convincing fashion
| alexmolas wrote:
| There are two examples in the repo, one with CoRT and another one
| without. And the one without it it's much better than the one
| that uses it. Weird choice of examples...
| 2cheeze4u wrote:
| I think the names were switched up.
| irthomasthomas wrote:
| my favourite pattern rn: llm "write a savage, yet grounded roast
| of: $content" llm -c "Write an equally savage rebuttal" llm -c
| "first arbitrate and then synthesize a final review."
| getcrunk wrote:
| Hello cnn's
| m3kw9 wrote:
| Isn't this best of n?
| jedberg wrote:
| We're really going to need to figure out how to power all these
| GPUs with green power real quick, or we're going to melt the
| planet having AIs debate with themselves on the optimal solution
| to tik-tac-toe...
| nonethewiser wrote:
| Ive felt this way when using chatgpt for a simple search. Stuff
| that google could handle but would just be slower, mostly from
| me having to manually filter.
|
| Sometimes its the easiest way to complete a very small task but
| the cost difference on the backend has to be pretty damn large.
| The user inevitably ends up not caring whatsoever. Its just not
| real to them.
| ivape wrote:
| I caught infra people saying that's pretty much the only
| bottleneck in the data center right now, power and cooling. We
| know the AI needs to run against itself continuously, and
| that's just a fact.
| mparnisari wrote:
| So like rubber ducking for AI?
| 1970-01-01 wrote:
| "While hallucinating a duck, check my script for errors."
| z2 wrote:
| I would really like to see a fusion guidebook of mental tricks
| that work for humans and just as well for AI. Or humorously,
| perhaps prompt-engineering tricks that are also great mental
| hacks for better or clearer human thinking.
| WhitneyLand wrote:
| Why try this idea on base models only?
|
| The whole point of reasoning models is to automatically use COT
| and related techniques to bring out more capabilities.
|
| It would be interesting to see if this is doing anything that's
| not already being exploited.
| Lerc wrote:
| I kind of want to try something like this at a larger scale in an
| always-on mode where I have a 'senate' of debate. Rather than
| responding to prompts on a case by case basis, provide a list of
| tasks (potentially with deadlines) and let the senate work on
| them, break off into groups to manage subtasks, challenge results
| , make suggestions. Even potentially a tree of analysts where
| suggestions only gets passed up the tree when the parent node
| thinks a lower analysis is particularly insightful.
|
| I definitely think that directing models to approach a problem
| from a specific perspective can generate better or worse results.
| Creating a diverse set of perspectives along with critical
| analysis of their results should be able to produce some
| impressive results.
|
| Things like this would generate a massive number of tokens, but
| the cost per token is definitely heading in the right direction
| to allow for this. There is also the possibility of setting up an
| AI only IRC server where anybody can connect their own models for
| a shared debating chamber.
| nonethewiser wrote:
| In theory couldnt this just be baked into a single adversarial
| model?
| tonmoy wrote:
| Yes, but I guess the model is optimized for relatively quick
| response, whereas these techniques are allowing the model to
| spend more time to generate a higher quality response
| Lerc wrote:
| To an extent, but different models are better at different
| things.
|
| That is something I'm also curious about. Given models (that
| use the same tokenisation) that are better at different
| things, would their be interesting things to find by
| analysing the logprobs for tokens generated from identical
| inputs (including cross feeding the generated token from one
| to another)
|
| Surely there must be something notable at particular points
| when a model goes off on the wrong path.
| mikepurvis wrote:
| In doing some DevOps-y type tasks recently (ansible, packer,
| docker, baking images with guestfish), I've found it very
| frustrating how much ChatGPT will confidently tell me to use
| flags on tools that don't exist, or hallicinate completely non-
| existent functions or behaviours. And then when I spend time
| trying what it suggests only to hit a wall and come back like
| wtf mate it breezily goes "oh yes so you're right, good job
| figuring that out! You're so close now! Your next step is to do
| X and Y," and then serves up the same detailed tutorial as
| before but with the flag or whatever it was that it had wrong
| subtly changed.
|
| It definitely makes me feel like I'm dealing with an
| overenthusiastic intern who is throwing stuff over the wall
| without checking their work, and like maybe having a second bot
| sitting in front of the first one being like ARE YOUR SURE
| ABOUT THAT could really improve things.
| organsnyder wrote:
| I've enjoyed watching Claude try running commands with
| incorrect flags, trying them, and then adapting.
| nonelog wrote:
| Spot on.
| vunderba wrote:
| 100%. This has happened enough to me that I wished I could
| just inject the man page docs into it to at least act as a
| sanity check.
| 0x20cowboy wrote:
| I did a stint in Devops and I found every models to be like
| this for all of the infra-as-code languages. Anything yaml
| based was especially bad.
|
| Even Amazon's own offering completely made things up about
| Amazon's own formats.
|
| I'd be curious as to why that is. It seems like there would
| be enough training data, and for Amazon in particular it
| seems like they could make a validation tool the model could
| use.
| mikepurvis wrote:
| Maybe I'm excessively anthropomorphizing, but it does feel
| a bit analogous to my own thought process, like "I need
| feature XYZ, and based on other tools I'm more familiar
| with it should be an --xyz flag, so let me google for that
| and see if I'm right or if I instead find a four-year-old
| wontfix on Github where someone asked for what I need and
| got denied."
|
| Except... the model is missing that final step; instead it
| just belches out its hypothesis, all dressed up in chirpy,
| confident-sounding language, certain that I'm moments away
| from having everything working just perfectly.
| MoonGhost wrote:
| You can't get more info from LLMs than it actually holds.
| Like Anthropic pointed if LLMs knows the name but has no
| other info it starts hallucinating. The same probably happens
| here. LLM knows there must be a flag but can't remember all
| of them. Likely short reminder in prompt will help. (or
| search web for GPT) Just my $0.02.
| mikepurvis wrote:
| It certainly feels like you can just by challenging it;
| then it happily finds other paths to what you want. So
| maybe internally it needs a second voice encouraging it to
| think harder about alternatives upfront.
| crowcroft wrote:
| Like, just endlessly grinding tokens, then processing the
| output and pulling out good ideas when the endless debate
| generates them?
|
| Would be interesting what it comes up with with enough time and
| tokens.
| vunderba wrote:
| A year or so ago I experimented with splitting a user prompt
| down to a set of "different AI personas" that would each try to
| approach the user's problem in a different way and then bubble
| back up with a master arbiter for consensus.
|
| I modeled it after the concept of advisors from Civilization
| II. It worked reasonably well though I think it was at least
| somewhat limited by being constrained to a single LLM
| (Mistral). It also lit my computer on fire.
| bee_rider wrote:
| What sort of personalities did you try? A group where some
| members have grudges against each other and will irrationally
| poke holes in each other's plans could be a fun experiment.
| throwup238 wrote:
| With multiple groups with external and internal rivalries.
| The Always Sunny gang versus The IT Crowd.
| danielmarkbruce wrote:
| This is being done, and you could apply it to a lot of domains.
| Go for it for whatever use case you have.
| taneq wrote:
| A society of mind, if you will. :)
|
| This sounds like a fun thing to set up with a quick-enough
| local model.
| csours wrote:
| Yes, give the computers anxiety too!
| lenerdenator wrote:
| I, too, like to give Terminator lite anxiety.
| lepisma wrote:
| Debates have worked good for me while learning something new:
|
| https://lepisma.xyz/2024/10/19/interventional-debates-for-st...
|
| I believe there are researches on this too.
| mritchie712 wrote:
| Did something similar (OverkiLLM) to this waayyyy back in August
| with open LLMs. I'm sure it'd work much better now:
|
| https://www.definite.app/blog/overkillm
| daxfohl wrote:
| Maybe have a "reconcile" option, for it to see if it can mix and
| match the best parts of each alternative rather than just
| choosing one.
| grzracz wrote:
| Your readme demo images are wrong: the terminal one is the non-
| CoRT one and the GUI one is the one with CoRT. Confused me for a
| while
| noworriesnate wrote:
| I've had success telling the model it really needs to poop and if
| it gets to the point quickly it'll be able to leave the meeting
| and go do that. It actually works amazingly well.
|
| It's also a lot more ethical than verbal abuse, which some people
| say improves the results as well.
|
| Programming isn't what it used to be.
| tinix wrote:
| this works for getting out of traffic tickets too lol
| thunderbong wrote:
| A lot of the comments here are reminiscent of the early Google
| days when everyone was finding ways to search better!
| caseyy wrote:
| I tried something similar when Llama2 came out, pitting two
| assistants, who each believed the other is the user, against each
| other. Ultimately, it was the same model talking with itself. The
| system prompts for both had various instructions to disagree and
| criticise the opinion of the user. I provided the first message
| to get things started. Usually, it's be along the lines of
| "nuclear proliferation is harmful to humanity".
|
| After 15 or so iterations, both assistants would keep repeating
| the same things and find agreement anyway. Sometimes, the chat
| became unhinged and useless, but 95/100 times, it was agreement.
|
| Happy someone else made it work.
| generalizations wrote:
| I always assumed you'd have to use different models. Even if
| only one of them is large, the others would inject enough
| difference of opinion to keep it useful.
| zamalek wrote:
| This might be a situation that warrants a higher temperature.
| Actually, it could be worth starting a very high temperature
| initially and gradually decreasing it.
| throwawayForMe2 wrote:
| I wonder if the Scholastic method of the Schoolmen would be
| useful with its argument and counter argument style.
| odo1242 wrote:
| Something I do sometimes is:
|
| - Have an AI chat model come up with an answer to a problem.
|
| - Have it write a report discussing the details of the problem
| and why it's answer is correct, directed at a person or AI model
| who has no knowledge of the initial problem or technical field.
|
| - Have a second AI model with no knowledge of the problem grade
| the report, and write it's own report either (a) asking for
| clarification / more information about the problem that the
| original model didn't provide or (b) pointing out an
| inconsistency in the argument posed by the original model. Give
| this report back to the original model and ask it to write it's
| own report back with either the necessary information or changes.
|
| - Repeat until either the second AI model is convinced by the
| first AI model's explanation or the first AI model has
| implemented all the changes requested by the second AI model.
|
| It's super clunky but has given pretty good results in the cases
| where I tried it lol
| hsuduebc2 wrote:
| We're there any situation that first conclusion from AI was
| completely changed? Can you give generally examples of
| situations where it changed or significantly improved overall
| result? It sounds cool.
| nomel wrote:
| I would be interested to know how ofter "oscillations" occur,
| where they flip flop from being too "agreeable" to challenges
| (which probably is just a sparse latent space). This happens
| to me pretty frequently, where you can repeatedly say "no
| that's wrong" and the LLM will do a 180, explaining why it
| was "in fact" wrong and you are "right", repeat.
| JumpCrisscross wrote:
| Kagi's Assistant feature makes this super easy. Just switch
| assistants and ask them to check the other's work.
| StopDisinfo910 wrote:
| For anything semi-adversarial, I have had good results asking
| the AI to come up with a plan, then take the side of the
| opponent coming with counter play/way to defeat the plan,
| finally asking for a revision of the initial plan given the
| potential reaction from the opponent.
|
| The final plan you obtain is generally a lot more well rounded
| and thought out.
|
| I find that amusing because the technique also works when I
| apply it to me. Picking flaws in your plan before revisiting it
| actually works.
| itissid wrote:
| Isn't this kind of another way of how Inference Time Scaling
| works? It will basically produce several chain of thoughts and
| then pursue one that has maximum reward based on an internal
| function?
| jsight wrote:
| This reminds me a lot of the YT video that went over using
| Monte Carlo Tree Search with LLMs to maximize result quality.
| Link:
| https://www.youtube.com/watch?v=mfAV_bigdRA&ab_channel=Treli...
|
| It seemed like a pretty good idea, though I'd guess that it
| would greatly increase token usage. I'd also be concerned that
| the LLM as a judge might struggle to grade things accurately if
| it wasn't also able to generate good enough answers to begin
| with.
| subscribed wrote:
| I do it all the time in Sillytavern in a group chat - three
| characters kind of resembling what you just described, and me,
| participating in the "conversation", them going back and forth
| until they're satisfied.
|
| With a good model role playing them, works awesome.
| zoogeny wrote:
| I do the same, and I have one other technique.
|
| I will often have a few chats going for a project, but with
| different contexts. For example, one might be tech focused,
| another marketing focused, another with some context on my
| personal goals, etc.
|
| So I will take the same question and feed it into the chats
| with differing context. It is almost like having different
| perspectives on the same problem. And the conclusions can often
| differ based on the differing contexts.
| ChadMoran wrote:
| Fast Agent has this as a first-class citizen called "Evaluator
| Optimizer" pattern. Where it in a loop with a defined number of
| max refinements judge itself and give the output a rating,
| demanding it improve it's output.
|
| Highly encourage others to check out Fast Agent. It has been
| delightful to use. It has interactive chat mode which I love and
| it's really tight and easy to implement.
|
| https://github.com/evalstate/fast-agent
| celltalk wrote:
| One of my doctoral propositions is, dialog leads to true
| artificial intelligence.
| badmonster wrote:
| Have you experimented with weighting the self-evaluations based
| on specific criteria (e.g., correctness, clarity, creativity), or
| using external validators to guide the AI's final choice? Curious
| how much tuning the evaluation step impacts overall performance.
| j45 wrote:
| There appear to be no shortage of token saving attempts that can
| end up using more tokens, whether it's a monthly paid plan or
| API.
|
| Having an approach to recognize what is needed from the AI
| software, and anticipate how it may default to respond based on
| it's programming is critical.
| yieldcrv wrote:
| Reminds me of baby agi from 2 years ago
|
| but I guess that was before chain of thought models
| DyslexicAtheist wrote:
| > _" I made my AI think"_ ...
|
| utterly moronic.
|
| They don't "think" ... not even in the most autistic sense of the
| word.
|
| They can generate solutions by combining existing knowledge in
| unique ways. But they don't "think".
| bilekas wrote:
| This is an interesting approach, it reminds me of YT creator
| actually. I'll find the YT creator, but basically he would make
| some script that would play the game like a race-course, with the
| goal being the finish line and iterate it N number of times, the
| script would keep iterating until it found the fastest solution.
|
| I believe they called that machine learning.. Or re-enforced
| training.
|
| I'm being slightly facetious, but my ignorant understanding of AI
| these days is basically the same no ?
|
| https://www.youtube.com/watch?v=SX08NT55YhA
| cwillu wrote:
| Any api that lets you constrain output to a formal syntax should
| let you do away with the "first output a number, and only then
| explain yourself" boilerplate.
| hu3 wrote:
| Here's some related challenge I'm facing. Maybe someone can help
| me:
|
| I also managed to make AI critique itself and that improved code
| generation a ton.
|
| For a TypeScript backend project that runs with Bun, I tell AI to
| also generate and run unit tests after every code change
| suggested by AI.
|
| How do you solve the risk of AI writting and executing unit tests
| with something like `rm -rf /` and wiping your files?
|
| Docker works but I like to keep things simple.
|
| Deno supports revoking file access but I'd like to keep using
| Bun.
| zactato wrote:
| Either you trust AI or you don't? If you don't trust it then
| you need to review what it's writing.
|
| Docker seems like a pretty low complexity way to create an
| isolated environment to run automation.
| derwiki wrote:
| Manually approve every terminal command it wants to run instead
| of vibe mode. Tbh I think an rm -rf scenario is exceedingly
| unlikely.
| ivape wrote:
| a) You should only do this in a sandbox
|
| b) You can have the AI run a "firewall" prompt on the final
| output. So your final output should go through a "You are a
| firewall that checks for dangerous terminal commands such as
| <enumerate list of dangerous commands>. If you spot dangerous
| commands, reform the command so that it is not dangerous"
| albertgoeswoof wrote:
| How far is this going to go? Are we going to have a team of AI
| agents that runs a scrum team and meets for stand ups every
| couple of hours?
|
| Are we going to replicate government bureaucracy with agents all
| debating topics all day long to find the best opinion?
| parrit wrote:
| Maybe. Humans form teams for a reason. Yes there are different
| exepriences and points of view in a human (vs. Not so much in
| LLM), but sometimes a different hat it all it takes. E.g. Code
| reviewer vs. Coder.
| jbellis wrote:
| does it actually make a difference to do M rounds of N vs one
| round of M*N?
| asdfman123 wrote:
| And when I do this people say I'm overanalyzing
| ivape wrote:
| The thing that makes _us_ weird to regular people is what 's
| going to make us uniquely positioned to utilize AI. If people
| only _knew_ the level at which I overanalyze and entertain
| weird ideas. I always inject these personality quirks into my
| instructions and get very creative results. In a weird way, I
| 'm starting to appreciate just how weird I actually am.
| killerstorm wrote:
| This is similar to Tree-of-Thought with self-evaluation.
| zekenie wrote:
| I feel like itd be cool to try prompts based on an adversarial
| justice system... attorney agents arguing both sides, a judge
| ruling on "the law"--adherence to instructions etc
| ivape wrote:
| That's very easy to do. A prompt I regularly use is a "council"
| system. For example:
|
| "I believe I have been contacted by the supernatural. Here are
| the details <details>. Please form a council of seven people:
| 1) Secular scientist 2) Religious scientist 3) Paranormal
| historian 4) Secular Psychologist 5) Religious psychologist 6)
| Carl Jung 7) Richard Dawkins. The council should all be
| independent and provide their own objective analysis. Please
| have them create a final report and conclusions at the end".
|
| Your council can be anything, a law firm, a jury, a parent
| teacher association, whatever you want, and as you can see, you
| can throw in known people as well. This can all be done with
| one prompt. It's one my favorite things to do.
| parrit wrote:
| I want to see "Meh" vs. "Holy crap" as a benchmark in a paper
| published by Google. Or more likely I suspect, Andrej.
| hansmayer wrote:
| Right, so... but you do realise its still just producing random
| output based on how you reconfigured it's weights, right?
| Sometimes it will happen to resonate with what you need. But it
| still neither thinking nor arguing with itself.
| stormfather wrote:
| I made a trading bot that ingested news. The prompt to assess
| impact was to simulate a debate between Charlie Munger and Warren
| Buffet on whether to invest.
| internetter wrote:
| How did it do?
| Svoka wrote:
| Oh. I was just asking "Use dialectic method on your solution" in
| the end of the prompt... It does make it think harder.
| gnarlouse wrote:
| This seems like low hanging fruit; are we seriously supposed to
| believe this is new and novel?
| ashoeafoot wrote:
| Give it reward and punishment evaluations, exploring the noise in
| parallel, extinction for the non rewarding answers ?
___________________________________________________________________
(page generated 2025-04-29 23:00 UTC)