[HN Gopher] About AI Evals
___________________________________________________________________
About AI Evals
Author : TheIronYuppie
Score : 144 points
Date : 2025-07-01 02:48 UTC (2 days ago)
(HTM) web link (hamel.dev)
(TXT) w3m dump (hamel.dev)
| afro88 wrote:
| Some great info, but I have to disagree with this:
|
| > Q: How much time should I spend on model selection?
|
| > Many developers fixate on model selection as the primary way to
| improve their LLM applications. Start with error analysis to
| understand your failure modes before considering model switching.
| As Hamel noted in office hours, "I suggest not thinking of
| switching model as the main axes of how to improve your system
| off the bat without evidence. Does error analysis suggest that
| your model is the problem?"
|
| If there's a clear jump in evals from one model to the next (ie
| Gemini 2 to 2.5, or Claude 3.7 to 4) that will level up your
| system pretty easily. Use the best models you can, if you can
| afford it.
| lumost wrote:
| The vast majority of a I startups will fail for reasons other
| than model costs. If you crack your use case, model costs
| should fall exponentially.
| softwaredoug wrote:
| I might disagree as these models are pretty inscrutable, and
| behavior on _your specific task_ can be dramatically different
| on a new /"better" model. Teams would do well to have the right
| evals to make this decision rather than get surprised.
|
| Also the "if you can afford it" can be fairly non trivial
| decision.
| smcleod wrote:
| Yeah totally agree, I see so many systems perform badly only to
| find out they're using an older generation mode and simply
| updating to the current mode fixes many of their issues.
| simonw wrote:
| I think the key part if that advice is the _without evidence_
| bit:
|
| > I suggest not thinking of switching model as the main axes of
| how to improve your system off the bat without evidence.
|
| If you try to fix problems by switching from eg Gemini 2.5
| Flash to OpenAI o3 but you don't have any evals in place how
| will you tell if the model switch actually helped?
| phillipcarter wrote:
| > If there's a clear jump in evals from one model to the next
| (ie Gemini 2 to 2.5, or Claude 3.7 to 4)
|
| How do you know that _their_ evals match behavior in your
| application? What if the older, "worse" model actually does
| some things better, but if you don't have comprehensive enough
| evals for your own domain, you simply don't know to check the
| things it's good at?
|
| FWIW I agree that in general, you should start with the most
| powerful model you can afford, and use that to bootstrap your
| evals. But I do not think you can rely on generic benchmarks
| and evals as a proxy for your own domain. I've run into this
| several times where an ostensibly better model does no better
| than the previous generation.
| shrumm wrote:
| The 'with evidence' part is key as simonw said. One anecdote
| from evals at Cleric - it's rare to see a new model do better
| on our evals vs the current one. The reality is that you'll
| optimize prompts etc for the current model.
|
| Instead, if a new model only does marginally worse - that's a
| strong signal that the new model is indeed better for our use
| case.
| ndr wrote:
| Quality can drop drastically even moving from Model N to N+1
| from the same provider, let alone a different one.
|
| You'll have to adjust a bunch of prompts and measure. And if
| you didn't have a baseline to begin with good luck YOLOing your
| way out of it.
| calebkaiser wrote:
| I'm biased in that I work on an open source project in this
| space, but I would strongly recommend starting with a free/open
| source platform for debugging/tracing, annotating, and building
| custom evals.
|
| This niche of the field has come a very long way just over the
| last 12 months, and the tooling is so much better than it used to
| be. Trying to do this from scratch, beyond a "kinda sorta good
| enough for now" project, is a full-time engineering project in
| and of itself.
|
| I'm a maintainer of Opik, but you have plenty of options in the
| space these days for whatever your particular needs are:
| https://github.com/comet-ml/opik
| mbanerjeepalmer wrote:
| Yes I'm not sure I really want to vibe code something that does
| auto evals on a sample of my OTEL traces any more than I want
| to build my own analytics library.
|
| Alternatives to Opik include Braintrust (closed), Promptfoo
| (open, https://github.com/promptfoo/promptfoo) and Laminar
| (open, https://github.com/lmnr-ai/lmnr).
| Onawa wrote:
| I've used and liked Promptfoo a lot, but I've run into issues
| when trying to do evaluations with too many independent
| variables. Works great for `models * prompts * variables`,
| but broke down when we wanted `models * prompts *
| variables^x`.
| andybak wrote:
| > About AI Evals
|
| Maybe it's obvious to some - but I was hoping that page started
| off by explaining what the hell an AI Eval specifically is.
|
| I can probably guess from context but I'd love to have some
| validation.
| phren0logy wrote:
| Here's another article by the same author with more background
| on AI Evals: https://hamel.dev/blog/posts/evals/
|
| I've appreciated Hamel's thinking on this topic.
| xpe wrote:
| From that article:
|
| > On a related note, unlike traditional unit tests, you don't
| necessarily need a 100% pass rate. Your pass rate is a
| product decision, depending on the failures you are willing
| to tolerate.
|
| Not sure how I feel about this, given expectations, culture,
| and tooling around CI. This suggestion seems to blur the line
| between a score from an eval and the usual idea of a unit
| test.
|
| P.S. It is also useful to track regressions on a per-test
| basis.
| davedx wrote:
| I've worked with LLM's for the better part of the last couple of
| years, including on evals, but I still don't understand a lot of
| what's being suggested. What exactly is a "custom annotation
| tool", for annotating what?
| calebkaiser wrote:
| Typically, you would collect a ton of execution traces from
| your production app. Annotating them can mean a lot of
| different things, but often it means some mixture of automated
| scoring and manual review. At the earliest stages, you're
| usually annotating common modes of failure, so you can say like
| "In 30% of failures, the retrieval component of our RAG app is
| grabbing irrelevant context." or "In 15% of cases, our chat
| agent misunderstood the user's query and did not ask
| clarifiying questions."
|
| You can then create datasets out of these traces, and use them
| to benchmark improvements you make to your application.
| spmurrayzzz wrote:
| Concrete example from my own workflows: in my IDE whenever I
| accept or reject a FIM completion, I capture that data (the
| prefix, the suffix, the completion, and the thumbs up/down
| signal) and put it in a database. The resultant dataset is
| annotated such that I can use it for analysis, debugging,
| finetuning, prompt mgmt, etc. The "custom" tooling part in this
| case would be that I'm using a forked version of Zed that I've
| customized in part for this purpose.
| pamelafox wrote:
| Fantastic FAQ, thank you Hamel for writing it up. We had an open
| space on AI Evals at Pycon this year, and had lots of discussion
| around similar questions. I only wrote down the questions,
| however:
|
| # Evaluation Metrics & Methodology
|
| * What metrics do you use (e.g., BERTScore, ROUGE, F1)? Are
| similarity metrics still useful?
|
| * Do you use step-by-step evaluations or evaluate full responses?
|
| * How do you evaluate VLM (vision-language model) summarization?
| Do you sample outputs or extract named entities?
|
| * How do you approach offline (ground truth) vs. online
| evaluation?
|
| * How do you handle uncertainty or "don't know" cases?
| (Temperature settings?)
|
| * How do you evaluate multi-turn conversations?
|
| * A/B comparisons and discrete labels (e.g., good/bad) are easier
| to interpret.
|
| * It's important to counteract bias toward your own favorite eval
| questions--ensure a diverse dataset.
|
| ## Prompting & Models
|
| * Do you modify prompts based on the specific app being
| evaluated?
|
| * Where do you store prompts--text files, Prompty, database, or
| in code?
|
| * Do you have domain experts edit or review prompts?
|
| * How do you choose which model to use?
|
| ## Evaluation Infrastructure
|
| * How do you choose an evaluation framework?
|
| * What platforms do you use to gather domain expert feedback or
| labels?
|
| * Do domain experts label outputs or also help with prompt
| design?
|
| ## User Feedback & Observability
|
| * Do you collect thumbs up / thumbs down feedback?
|
| * How does observability help identify failure modes?
|
| * Do models tend to favor their own outputs? (There's research on
| this.)
|
| I personally work on adding evaluation to our most popular Azure
| RAG samples, and put a Textual CLI interface in this repo that
| I've found helpful for reviewing the eval results:
| https://github.com/Azure-Samples/ai-rag-chat-evaluator
| mmanulis wrote:
| Any chance you can share what the answers were for choosing an
| evaluation framework?
| hamelsmu wrote:
| This is Hamel. Thanks for sharing! I will incorporate these
| into the FAQ. I love getting additional questions like this.
| satisfice wrote:
| This reads like a collection of ad hoc advice overfitted to
| experience that is probably obsolete or will be tomorrow. And we
| don't even know if it does fit the author's experience.
|
| I am looking for solid evidence of the efficacy of folk theories
| about how to make AI perform evaluation.
|
| Seems to me a bunch of people are hoping that AI can test AI, and
| that it can to some degree. But in the end AI cannot be
| accountable for such testing, and we can never know all the holes
| in its judgment, nor can we expect that fixing a hole will not
| tear open other holes.
| simonw wrote:
| Hamel wrote a whole lot more about the "LLM as a judge" pattern
| (where you use LLMs to evaluate the output of other LLMs) here:
| https://hamel.dev/blog/posts/llm-judge/
| hamelsmu wrote:
| Appreciate it, Simon! I have now edited my post to include
| links to "intro to evals" for those not familiar.
| ReDeiPirati wrote:
| > Q: What makes a good custom interface for reviewing LLM
| outputs? Great interfaces make human review fast, clear, and
| motivating. We recommend building your own annotation tool
| customized to your domain ...
|
| Ah! This is a horrible advice. Why should you recommend
| reinventing the wheel where there is already great open source
| software available? Just use
| https://github.com/HumanSignal/label-studio/ or any other type of
| open source annotation software you want to get started. These
| tools cover already pretty much all the possible use-cases, and
| if they aren't you can just build on top of them instead of
| building it from zero.
| abletonlive wrote:
| This awful advice can't be blanket applied and misses the
| point: starting from zero is extremely easy now with LLMs, the
| last 10% is the hardest part. Not only that, if you don't start
| from zero you aren't able to build from whatever you think the
| new first principles are. Spacex would not exist if it tried to
| extend old paradigm of rocketry.
|
| There's nothing wrong with starting from scratch or rebuilding
| an existing tool from the ground up. There's no reason to
| blindly build from the status quo.
| ReDeiPirati wrote:
| I'd have agreed with you, if the principles would be
| different. But what was showed in the content is EXACTLY what
| those tools are doing today. Actually those tools are way
| more powerful and considering & covering way more scenarios.
|
| > There's nothing wrong with starting from scratch or
| rebuilding an existing tool from the ground up. There's no
| reason to blindly build from the status quo.
|
| Generally speaking all the options are ok, but not if you
| want to have something up as fast as you can or if your team
| is piloting something. I think the time you spend to vibe
| code it is greater than to setting any of those tools up.
|
| And BTW, you shouldn't vibe code something that flows
| proprietary data. At least you would work with co-pilots
| jph00 wrote:
| Label Studio is great, but by trying to cover so many use
| cases, it becomes pretty complex.
|
| I've found it's often easier to just whip up something for my
| specific needs, when I need it.
| bbischof wrote:
| Label studio is fine if it covers your need, but in many cases
| the core opportunity in an eval interface is fitting in with
| the SME's workflow or current tech stack.
|
| If label studio looks like what they can use, it's fine. If
| not, a day of vibecoding is worth the effort to make your
| partners with special knowledge comfortable.
| dbish wrote:
| I think the truth is somewhere in between. I find label studio
| to be lacking a lot of niceties and generally built for very
| the average text labeling or image labeling use case, but
| anything else (like a multi-step agent workflow or some sort of
| multi-modal task specific problem) it is not quite right for
| and you do end up doing a bit of trying to build your own
| custom interface.
|
| So, imho you should try label studio but timebox and really
| decide for yourself quickly if it's going to work for you in a
| day, and if not go vibecode a different view and try it out or
| build labeling into a copy of a front end you're already using
| for your task if that's quick.
|
| What I think we really need here is a "lovable meets
| labelstudio" that starts with simple defaults and lets anyone
| use natural language, sketches, screenshots, to create custom
| interfaces and modify them quickly.
| ultrasaurus wrote:
| The SaaS version of Label Studio does have a natural language
| interface to create custom interfaces:
| https://docs.humansignal.com/guide/ask_ai
|
| I'm ostensibly an expert in the product and I probably use
| that 90%+ of the time (unless I'm testing something specific)
| -- using a sketch as input is a cool idea though!
|
| Disclaimer: I'm the VP Product at HumanSignal the company
| behind Label Studio.
| th0ma5 wrote:
| People should be demanding consistency and traceability from the
| model vendors checked by some tool perhaps like this. This may
| tell you when the vendor changed something but there is otherwise
| no recourse?
| dbish wrote:
| Hamel has really great practical eval advice and I always share
| his advice and posts to any new teams developing AI
| features/agents/assistants that I'm working with, both internally
| and with new startups in the AI applications space.
|
| What I'd love to see one day is a way to capture this advice in a
| "Hamel in a box" eval copilot, or the agent that helps eval and
| improve other ai agents :). An eval expert who can ask the
| questions he's asking, look at data flowing through your system,
| make suggestions about how to improve your eval process, and
| automatically guide non experts into following good practices for
| their eval loop.
| hamelsmu wrote:
| I think that will be very possible soon! We continue to write
| about it publicly :) Also thanks to my friends and colleagues
| who write a lot on this subject that I frequently collaborate
| with:
|
| - Shreya Shankar https://www.sh-reya.com/ - Eugene Yan
| https://eugeneyan.com/ - Bryan Bischof
| https://bio.site/Docdonut
| _jonas wrote:
| Evals are critical, and I love the practicality of this guide!
|
| One problem not covered here is: knowing which data to review.
|
| If your AI system produces say 95% accurate responses, your Evals
| team will spend too much time reviewing production logs to
| discover different AI failure modes.
|
| To enable your Evals team to only spend time reviewing the high-
| signal responses that are likely incorrect, I built a tool that
| automatically surfaces the least trustworthy LLM responses:
|
| https://help.cleanlab.ai/tlm/
|
| Hope you find it useful, I made sure it works out-of-the-box with
| zero-configuration required!
| hamelsmu wrote:
| Hamel here. Thanks so much for asking this question! I will
| work on adding it to the FAQ. Please keep these coming!
___________________________________________________________________
(page generated 2025-07-03 23:00 UTC)