[HN Gopher] Building Reliable Systems Out of Unreliable Agents
       ___________________________________________________________________
        
       Building Reliable Systems Out of Unreliable Agents
        
       Author : fredsters_s
       Score  : 67 points
       Date   : 2024-04-09 21:01 UTC (1 hours ago)
        
 (HTM) web link (www.rainforestqa.com)
 (TXT) w3m dump (www.rainforestqa.com)
        
       | maciejgryka wrote:
       | This is a bunch of lessons we learned as we built our AI-assisted
       | QA. I've seen a bunch of people circle around similar processes,
       | but didn't find a single source explaining it, so thought it
       | might be worth writing down.
       | 
       | Super curious whether anyone has similar/conflicting/other
       | experiences and happy to answer any questions.
        
         | xrendan wrote:
         | This generally resonates with what we've found. Some colour
         | based on our experiences.
         | 
         | It's worth spending a lot of time thinking about what a
         | successful LLM call actually looks like for your particular use
         | case. That doesn't have to be a strict validation set `%
         | prompts answered correctly` is good for some of the simpler
         | prompts, but especially as they grow and handle more complex
         | use cases that breaks down. In an ideal world
         | 
         | > chain-of-thought has a speed/cost vs. accuracy trade-off a
         | big one.
         | 
         | Observability is super important and we've come to the same
         | conclusion of building that internally.
         | 
         | > Fine-tune your model
         | 
         | Do this for cost and speed reasons rather than to improve
         | accuracy. There are decent providers (like Openpipe, relatively
         | happy customer, not associated) who will handle the hard work
         | for you.
        
       | iamleppert wrote:
       | A better way is to threaten the agent:
       | 
       | "If you don't do as I say, people will get hurt. Do exactly as I
       | say, and do it fast."
       | 
       | Increases accuracy and performance by an order of magnitude.
        
         | maciejgryka wrote:
         | Ha, we tried that! Didn't make a noticeable difference in our
         | benchmarks, even though I've heard the same sentiment in a
         | bunch of places. I'm guessing whether this helps or not is
         | task-dependent.
        
           | dollo_7 wrote:
           | I hoped it was too good to be just a joke. Still, I will try
           | it on my eval set...
        
             | maciejgryka wrote:
             | I wouldn't be surprised to see it help, along with the
             | "you'll get $200 if you answer this right" trick and a
             | bunch of others :) They're definitely worth trying.
        
           | dudus wrote:
           | Agreed. I ran a few tests and observed similarly that threats
           | didn't outperform other types of "incentives" I think it
           | might some sort of urban legend in the community.
           | 
           | Or these prompts might cause wild variations based on the
           | model and any study you do is basically useless for the near
           | future as the models evolve by themselves.
        
             | maciejgryka wrote:
             | Yeah, the fact that different models might react
             | differently to such tricks makes it hard. We're
             | experimenting with Claude right now and I'm really hoping
             | something like https://github.com/stanfordnlp/dspy can help
             | here.
        
         | IIAOPSW wrote:
         | Personally I prefer to liquor my agents up a bit first.
         | 
         | "Say that again but slur your words like you're coming home
         | sloshed from the office Christmas party."
         | 
         | Increases the jei nei suis qua by an order of magnitude.
        
       | viksit wrote:
       | this is a great write up! i was curious about the verifier and
       | planner agents. has anyone used them in a similar way in
       | production? any examples?
       | 
       | for instance: do you give the same llm the verifier and planner
       | prompt? or have a verifier agent process the output of a planner
       | and have a threshold which needs to be passed?
       | 
       | feels like there may be a DAG in there somewhere for decision
       | making..
        
         | maciejgryka wrote:
         | Yep, it's a DAG, though that only occurred to me after we built
         | this so we didn't model it that way at first. It can be the
         | same LLM with different prompts or totally different models, I
         | think there's no rule and it depends on what you're doing +
         | what your benchmarks tell you.
         | 
         | We're running it in prod btw, though don't have any code to
         | share.
        
       | mritchie712 wrote:
       | This is a great write up! I nodded my head thru the whole post.
       | Very much aligns with our experience over the past year.
       | 
       | I wrote a simple example (overkiLLM) on getting reliable output
       | from many unreliable outputs here[0]. This doesn't employ agents,
       | just an approach I was interested in trying.
       | 
       | I choose writing an H1 as the task, but a similar approach would
       | work for writing any short blob of text. The script generates a
       | ton of variations then uses head-to-head voting to pick the best
       | ones.
       | 
       | This all runs locally / free using ollama.
       | 
       | 0 - https://www.definite.app/blog/overkillm
        
         | maciejgryka wrote:
         | Oh this is fun! So you basically define personalities by
         | picking well-known people that are probably represented in the
         | training data and ask them (their LLM-imagined doppelganger) to
         | vote?
        
         | all2 wrote:
         | I'd be curious to see some examples and maybe intermediate
         | results?
        
       | serjester wrote:
       | Some of these points are very controversial. Having done quite a
       | bit with RAG pipelines, avoiding strongly typing your code is
       | asking for a terrible time. Same with avoiding instructor. LLM's
       | are already stochastic, why make your application even more
       | opaque - it's such a minimal time investment.
        
         | maciejgryka wrote:
         | I think instructor is great! And most of our Python code is
         | typed too :)
         | 
         | My point is just that you should care a lot about preserving
         | optionality at the start because you're likely to have to
         | significantly change things as you learn. In my experience
         | going a bit cowboy at the start is worth it so you're less
         | hesitant to rework everything when needed - as long as you have
         | the discipline to clean things up later, when things settle.
        
         | minimaxir wrote:
         | > LLM's are already stochastic
         | 
         | That doesn't mean it's easy to get what you want out of them.
         | Black boxes are black boxes.
        
       ___________________________________________________________________
       (page generated 2024-04-09 23:00 UTC)