[HN Gopher] Building Reliable Systems Out of Unreliable Agents
___________________________________________________________________
Building Reliable Systems Out of Unreliable Agents
Author : fredsters_s
Score : 67 points
Date : 2024-04-09 21:01 UTC (1 hours ago)
(HTM) web link (www.rainforestqa.com)
(TXT) w3m dump (www.rainforestqa.com)
| maciejgryka wrote:
| This is a bunch of lessons we learned as we built our AI-assisted
| QA. I've seen a bunch of people circle around similar processes,
| but didn't find a single source explaining it, so thought it
| might be worth writing down.
|
| Super curious whether anyone has similar/conflicting/other
| experiences and happy to answer any questions.
| xrendan wrote:
| This generally resonates with what we've found. Some colour
| based on our experiences.
|
| It's worth spending a lot of time thinking about what a
| successful LLM call actually looks like for your particular use
| case. That doesn't have to be a strict validation set `%
| prompts answered correctly` is good for some of the simpler
| prompts, but especially as they grow and handle more complex
| use cases that breaks down. In an ideal world
|
| > chain-of-thought has a speed/cost vs. accuracy trade-off a
| big one.
|
| Observability is super important and we've come to the same
| conclusion of building that internally.
|
| > Fine-tune your model
|
| Do this for cost and speed reasons rather than to improve
| accuracy. There are decent providers (like Openpipe, relatively
| happy customer, not associated) who will handle the hard work
| for you.
| iamleppert wrote:
| A better way is to threaten the agent:
|
| "If you don't do as I say, people will get hurt. Do exactly as I
| say, and do it fast."
|
| Increases accuracy and performance by an order of magnitude.
| maciejgryka wrote:
| Ha, we tried that! Didn't make a noticeable difference in our
| benchmarks, even though I've heard the same sentiment in a
| bunch of places. I'm guessing whether this helps or not is
| task-dependent.
| dollo_7 wrote:
| I hoped it was too good to be just a joke. Still, I will try
| it on my eval set...
| maciejgryka wrote:
| I wouldn't be surprised to see it help, along with the
| "you'll get $200 if you answer this right" trick and a
| bunch of others :) They're definitely worth trying.
| dudus wrote:
| Agreed. I ran a few tests and observed similarly that threats
| didn't outperform other types of "incentives" I think it
| might some sort of urban legend in the community.
|
| Or these prompts might cause wild variations based on the
| model and any study you do is basically useless for the near
| future as the models evolve by themselves.
| maciejgryka wrote:
| Yeah, the fact that different models might react
| differently to such tricks makes it hard. We're
| experimenting with Claude right now and I'm really hoping
| something like https://github.com/stanfordnlp/dspy can help
| here.
| IIAOPSW wrote:
| Personally I prefer to liquor my agents up a bit first.
|
| "Say that again but slur your words like you're coming home
| sloshed from the office Christmas party."
|
| Increases the jei nei suis qua by an order of magnitude.
| viksit wrote:
| this is a great write up! i was curious about the verifier and
| planner agents. has anyone used them in a similar way in
| production? any examples?
|
| for instance: do you give the same llm the verifier and planner
| prompt? or have a verifier agent process the output of a planner
| and have a threshold which needs to be passed?
|
| feels like there may be a DAG in there somewhere for decision
| making..
| maciejgryka wrote:
| Yep, it's a DAG, though that only occurred to me after we built
| this so we didn't model it that way at first. It can be the
| same LLM with different prompts or totally different models, I
| think there's no rule and it depends on what you're doing +
| what your benchmarks tell you.
|
| We're running it in prod btw, though don't have any code to
| share.
| mritchie712 wrote:
| This is a great write up! I nodded my head thru the whole post.
| Very much aligns with our experience over the past year.
|
| I wrote a simple example (overkiLLM) on getting reliable output
| from many unreliable outputs here[0]. This doesn't employ agents,
| just an approach I was interested in trying.
|
| I choose writing an H1 as the task, but a similar approach would
| work for writing any short blob of text. The script generates a
| ton of variations then uses head-to-head voting to pick the best
| ones.
|
| This all runs locally / free using ollama.
|
| 0 - https://www.definite.app/blog/overkillm
| maciejgryka wrote:
| Oh this is fun! So you basically define personalities by
| picking well-known people that are probably represented in the
| training data and ask them (their LLM-imagined doppelganger) to
| vote?
| all2 wrote:
| I'd be curious to see some examples and maybe intermediate
| results?
| serjester wrote:
| Some of these points are very controversial. Having done quite a
| bit with RAG pipelines, avoiding strongly typing your code is
| asking for a terrible time. Same with avoiding instructor. LLM's
| are already stochastic, why make your application even more
| opaque - it's such a minimal time investment.
| maciejgryka wrote:
| I think instructor is great! And most of our Python code is
| typed too :)
|
| My point is just that you should care a lot about preserving
| optionality at the start because you're likely to have to
| significantly change things as you learn. In my experience
| going a bit cowboy at the start is worth it so you're less
| hesitant to rework everything when needed - as long as you have
| the discipline to clean things up later, when things settle.
| minimaxir wrote:
| > LLM's are already stochastic
|
| That doesn't mean it's easy to get what you want out of them.
| Black boxes are black boxes.
___________________________________________________________________
(page generated 2024-04-09 23:00 UTC)