[HN Gopher] About AI Evals
       ___________________________________________________________________
        
       About AI Evals
        
       Author : TheIronYuppie
       Score  : 144 points
       Date   : 2025-07-01 02:48 UTC (2 days ago)
        
 (HTM) web link (hamel.dev)
 (TXT) w3m dump (hamel.dev)
        
       | afro88 wrote:
       | Some great info, but I have to disagree with this:
       | 
       | > Q: How much time should I spend on model selection?
       | 
       | > Many developers fixate on model selection as the primary way to
       | improve their LLM applications. Start with error analysis to
       | understand your failure modes before considering model switching.
       | As Hamel noted in office hours, "I suggest not thinking of
       | switching model as the main axes of how to improve your system
       | off the bat without evidence. Does error analysis suggest that
       | your model is the problem?"
       | 
       | If there's a clear jump in evals from one model to the next (ie
       | Gemini 2 to 2.5, or Claude 3.7 to 4) that will level up your
       | system pretty easily. Use the best models you can, if you can
       | afford it.
        
         | lumost wrote:
         | The vast majority of a I startups will fail for reasons other
         | than model costs. If you crack your use case, model costs
         | should fall exponentially.
        
         | softwaredoug wrote:
         | I might disagree as these models are pretty inscrutable, and
         | behavior on _your specific task_ can be dramatically different
         | on a new /"better" model. Teams would do well to have the right
         | evals to make this decision rather than get surprised.
         | 
         | Also the "if you can afford it" can be fairly non trivial
         | decision.
        
         | smcleod wrote:
         | Yeah totally agree, I see so many systems perform badly only to
         | find out they're using an older generation mode and simply
         | updating to the current mode fixes many of their issues.
        
         | simonw wrote:
         | I think the key part if that advice is the _without evidence_
         | bit:
         | 
         | > I suggest not thinking of switching model as the main axes of
         | how to improve your system off the bat without evidence.
         | 
         | If you try to fix problems by switching from eg Gemini 2.5
         | Flash to OpenAI o3 but you don't have any evals in place how
         | will you tell if the model switch actually helped?
        
         | phillipcarter wrote:
         | > If there's a clear jump in evals from one model to the next
         | (ie Gemini 2 to 2.5, or Claude 3.7 to 4)
         | 
         | How do you know that _their_ evals match behavior in your
         | application? What if the older,  "worse" model actually does
         | some things better, but if you don't have comprehensive enough
         | evals for your own domain, you simply don't know to check the
         | things it's good at?
         | 
         | FWIW I agree that in general, you should start with the most
         | powerful model you can afford, and use that to bootstrap your
         | evals. But I do not think you can rely on generic benchmarks
         | and evals as a proxy for your own domain. I've run into this
         | several times where an ostensibly better model does no better
         | than the previous generation.
        
         | shrumm wrote:
         | The 'with evidence' part is key as simonw said. One anecdote
         | from evals at Cleric - it's rare to see a new model do better
         | on our evals vs the current one. The reality is that you'll
         | optimize prompts etc for the current model.
         | 
         | Instead, if a new model only does marginally worse - that's a
         | strong signal that the new model is indeed better for our use
         | case.
        
         | ndr wrote:
         | Quality can drop drastically even moving from Model N to N+1
         | from the same provider, let alone a different one.
         | 
         | You'll have to adjust a bunch of prompts and measure. And if
         | you didn't have a baseline to begin with good luck YOLOing your
         | way out of it.
        
       | calebkaiser wrote:
       | I'm biased in that I work on an open source project in this
       | space, but I would strongly recommend starting with a free/open
       | source platform for debugging/tracing, annotating, and building
       | custom evals.
       | 
       | This niche of the field has come a very long way just over the
       | last 12 months, and the tooling is so much better than it used to
       | be. Trying to do this from scratch, beyond a "kinda sorta good
       | enough for now" project, is a full-time engineering project in
       | and of itself.
       | 
       | I'm a maintainer of Opik, but you have plenty of options in the
       | space these days for whatever your particular needs are:
       | https://github.com/comet-ml/opik
        
         | mbanerjeepalmer wrote:
         | Yes I'm not sure I really want to vibe code something that does
         | auto evals on a sample of my OTEL traces any more than I want
         | to build my own analytics library.
         | 
         | Alternatives to Opik include Braintrust (closed), Promptfoo
         | (open, https://github.com/promptfoo/promptfoo) and Laminar
         | (open, https://github.com/lmnr-ai/lmnr).
        
           | Onawa wrote:
           | I've used and liked Promptfoo a lot, but I've run into issues
           | when trying to do evaluations with too many independent
           | variables. Works great for `models * prompts * variables`,
           | but broke down when we wanted `models * prompts *
           | variables^x`.
        
       | andybak wrote:
       | > About AI Evals
       | 
       | Maybe it's obvious to some - but I was hoping that page started
       | off by explaining what the hell an AI Eval specifically is.
       | 
       | I can probably guess from context but I'd love to have some
       | validation.
        
         | phren0logy wrote:
         | Here's another article by the same author with more background
         | on AI Evals: https://hamel.dev/blog/posts/evals/
         | 
         | I've appreciated Hamel's thinking on this topic.
        
           | xpe wrote:
           | From that article:
           | 
           | > On a related note, unlike traditional unit tests, you don't
           | necessarily need a 100% pass rate. Your pass rate is a
           | product decision, depending on the failures you are willing
           | to tolerate.
           | 
           | Not sure how I feel about this, given expectations, culture,
           | and tooling around CI. This suggestion seems to blur the line
           | between a score from an eval and the usual idea of a unit
           | test.
           | 
           | P.S. It is also useful to track regressions on a per-test
           | basis.
        
       | davedx wrote:
       | I've worked with LLM's for the better part of the last couple of
       | years, including on evals, but I still don't understand a lot of
       | what's being suggested. What exactly is a "custom annotation
       | tool", for annotating what?
        
         | calebkaiser wrote:
         | Typically, you would collect a ton of execution traces from
         | your production app. Annotating them can mean a lot of
         | different things, but often it means some mixture of automated
         | scoring and manual review. At the earliest stages, you're
         | usually annotating common modes of failure, so you can say like
         | "In 30% of failures, the retrieval component of our RAG app is
         | grabbing irrelevant context." or "In 15% of cases, our chat
         | agent misunderstood the user's query and did not ask
         | clarifiying questions."
         | 
         | You can then create datasets out of these traces, and use them
         | to benchmark improvements you make to your application.
        
         | spmurrayzzz wrote:
         | Concrete example from my own workflows: in my IDE whenever I
         | accept or reject a FIM completion, I capture that data (the
         | prefix, the suffix, the completion, and the thumbs up/down
         | signal) and put it in a database. The resultant dataset is
         | annotated such that I can use it for analysis, debugging,
         | finetuning, prompt mgmt, etc. The "custom" tooling part in this
         | case would be that I'm using a forked version of Zed that I've
         | customized in part for this purpose.
        
       | pamelafox wrote:
       | Fantastic FAQ, thank you Hamel for writing it up. We had an open
       | space on AI Evals at Pycon this year, and had lots of discussion
       | around similar questions. I only wrote down the questions,
       | however:
       | 
       | # Evaluation Metrics & Methodology
       | 
       | * What metrics do you use (e.g., BERTScore, ROUGE, F1)? Are
       | similarity metrics still useful?
       | 
       | * Do you use step-by-step evaluations or evaluate full responses?
       | 
       | * How do you evaluate VLM (vision-language model) summarization?
       | Do you sample outputs or extract named entities?
       | 
       | * How do you approach offline (ground truth) vs. online
       | evaluation?
       | 
       | * How do you handle uncertainty or "don't know" cases?
       | (Temperature settings?)
       | 
       | * How do you evaluate multi-turn conversations?
       | 
       | * A/B comparisons and discrete labels (e.g., good/bad) are easier
       | to interpret.
       | 
       | * It's important to counteract bias toward your own favorite eval
       | questions--ensure a diverse dataset.
       | 
       | ## Prompting & Models
       | 
       | * Do you modify prompts based on the specific app being
       | evaluated?
       | 
       | * Where do you store prompts--text files, Prompty, database, or
       | in code?
       | 
       | * Do you have domain experts edit or review prompts?
       | 
       | * How do you choose which model to use?
       | 
       | ## Evaluation Infrastructure
       | 
       | * How do you choose an evaluation framework?
       | 
       | * What platforms do you use to gather domain expert feedback or
       | labels?
       | 
       | * Do domain experts label outputs or also help with prompt
       | design?
       | 
       | ## User Feedback & Observability
       | 
       | * Do you collect thumbs up / thumbs down feedback?
       | 
       | * How does observability help identify failure modes?
       | 
       | * Do models tend to favor their own outputs? (There's research on
       | this.)
       | 
       | I personally work on adding evaluation to our most popular Azure
       | RAG samples, and put a Textual CLI interface in this repo that
       | I've found helpful for reviewing the eval results:
       | https://github.com/Azure-Samples/ai-rag-chat-evaluator
        
         | mmanulis wrote:
         | Any chance you can share what the answers were for choosing an
         | evaluation framework?
        
         | hamelsmu wrote:
         | This is Hamel. Thanks for sharing! I will incorporate these
         | into the FAQ. I love getting additional questions like this.
        
       | satisfice wrote:
       | This reads like a collection of ad hoc advice overfitted to
       | experience that is probably obsolete or will be tomorrow. And we
       | don't even know if it does fit the author's experience.
       | 
       | I am looking for solid evidence of the efficacy of folk theories
       | about how to make AI perform evaluation.
       | 
       | Seems to me a bunch of people are hoping that AI can test AI, and
       | that it can to some degree. But in the end AI cannot be
       | accountable for such testing, and we can never know all the holes
       | in its judgment, nor can we expect that fixing a hole will not
       | tear open other holes.
        
         | simonw wrote:
         | Hamel wrote a whole lot more about the "LLM as a judge" pattern
         | (where you use LLMs to evaluate the output of other LLMs) here:
         | https://hamel.dev/blog/posts/llm-judge/
        
           | hamelsmu wrote:
           | Appreciate it, Simon! I have now edited my post to include
           | links to "intro to evals" for those not familiar.
        
       | ReDeiPirati wrote:
       | > Q: What makes a good custom interface for reviewing LLM
       | outputs? Great interfaces make human review fast, clear, and
       | motivating. We recommend building your own annotation tool
       | customized to your domain ...
       | 
       | Ah! This is a horrible advice. Why should you recommend
       | reinventing the wheel where there is already great open source
       | software available? Just use
       | https://github.com/HumanSignal/label-studio/ or any other type of
       | open source annotation software you want to get started. These
       | tools cover already pretty much all the possible use-cases, and
       | if they aren't you can just build on top of them instead of
       | building it from zero.
        
         | abletonlive wrote:
         | This awful advice can't be blanket applied and misses the
         | point: starting from zero is extremely easy now with LLMs, the
         | last 10% is the hardest part. Not only that, if you don't start
         | from zero you aren't able to build from whatever you think the
         | new first principles are. Spacex would not exist if it tried to
         | extend old paradigm of rocketry.
         | 
         | There's nothing wrong with starting from scratch or rebuilding
         | an existing tool from the ground up. There's no reason to
         | blindly build from the status quo.
        
           | ReDeiPirati wrote:
           | I'd have agreed with you, if the principles would be
           | different. But what was showed in the content is EXACTLY what
           | those tools are doing today. Actually those tools are way
           | more powerful and considering & covering way more scenarios.
           | 
           | > There's nothing wrong with starting from scratch or
           | rebuilding an existing tool from the ground up. There's no
           | reason to blindly build from the status quo.
           | 
           | Generally speaking all the options are ok, but not if you
           | want to have something up as fast as you can or if your team
           | is piloting something. I think the time you spend to vibe
           | code it is greater than to setting any of those tools up.
           | 
           | And BTW, you shouldn't vibe code something that flows
           | proprietary data. At least you would work with co-pilots
        
         | jph00 wrote:
         | Label Studio is great, but by trying to cover so many use
         | cases, it becomes pretty complex.
         | 
         | I've found it's often easier to just whip up something for my
         | specific needs, when I need it.
        
         | bbischof wrote:
         | Label studio is fine if it covers your need, but in many cases
         | the core opportunity in an eval interface is fitting in with
         | the SME's workflow or current tech stack.
         | 
         | If label studio looks like what they can use, it's fine. If
         | not, a day of vibecoding is worth the effort to make your
         | partners with special knowledge comfortable.
        
         | dbish wrote:
         | I think the truth is somewhere in between. I find label studio
         | to be lacking a lot of niceties and generally built for very
         | the average text labeling or image labeling use case, but
         | anything else (like a multi-step agent workflow or some sort of
         | multi-modal task specific problem) it is not quite right for
         | and you do end up doing a bit of trying to build your own
         | custom interface.
         | 
         | So, imho you should try label studio but timebox and really
         | decide for yourself quickly if it's going to work for you in a
         | day, and if not go vibecode a different view and try it out or
         | build labeling into a copy of a front end you're already using
         | for your task if that's quick.
         | 
         | What I think we really need here is a "lovable meets
         | labelstudio" that starts with simple defaults and lets anyone
         | use natural language, sketches, screenshots, to create custom
         | interfaces and modify them quickly.
        
           | ultrasaurus wrote:
           | The SaaS version of Label Studio does have a natural language
           | interface to create custom interfaces:
           | https://docs.humansignal.com/guide/ask_ai
           | 
           | I'm ostensibly an expert in the product and I probably use
           | that 90%+ of the time (unless I'm testing something specific)
           | -- using a sketch as input is a cool idea though!
           | 
           | Disclaimer: I'm the VP Product at HumanSignal the company
           | behind Label Studio.
        
       | th0ma5 wrote:
       | People should be demanding consistency and traceability from the
       | model vendors checked by some tool perhaps like this. This may
       | tell you when the vendor changed something but there is otherwise
       | no recourse?
        
       | dbish wrote:
       | Hamel has really great practical eval advice and I always share
       | his advice and posts to any new teams developing AI
       | features/agents/assistants that I'm working with, both internally
       | and with new startups in the AI applications space.
       | 
       | What I'd love to see one day is a way to capture this advice in a
       | "Hamel in a box" eval copilot, or the agent that helps eval and
       | improve other ai agents :). An eval expert who can ask the
       | questions he's asking, look at data flowing through your system,
       | make suggestions about how to improve your eval process, and
       | automatically guide non experts into following good practices for
       | their eval loop.
        
         | hamelsmu wrote:
         | I think that will be very possible soon! We continue to write
         | about it publicly :) Also thanks to my friends and colleagues
         | who write a lot on this subject that I frequently collaborate
         | with:
         | 
         | - Shreya Shankar https://www.sh-reya.com/ - Eugene Yan
         | https://eugeneyan.com/ - Bryan Bischof
         | https://bio.site/Docdonut
        
       | _jonas wrote:
       | Evals are critical, and I love the practicality of this guide!
       | 
       | One problem not covered here is: knowing which data to review.
       | 
       | If your AI system produces say 95% accurate responses, your Evals
       | team will spend too much time reviewing production logs to
       | discover different AI failure modes.
       | 
       | To enable your Evals team to only spend time reviewing the high-
       | signal responses that are likely incorrect, I built a tool that
       | automatically surfaces the least trustworthy LLM responses:
       | 
       | https://help.cleanlab.ai/tlm/
       | 
       | Hope you find it useful, I made sure it works out-of-the-box with
       | zero-configuration required!
        
         | hamelsmu wrote:
         | Hamel here. Thanks so much for asking this question! I will
         | work on adding it to the FAQ. Please keep these coming!
        
       ___________________________________________________________________
       (page generated 2025-07-03 23:00 UTC)