[HN Gopher] Evals are not all you need
___________________________________________________________________
Evals are not all you need
Author : amarble
Score : 32 points
Date : 2025-03-03 18:09 UTC (4 hours ago)
(HTM) web link (www.marble.onl)
(TXT) w3m dump (www.marble.onl)
| iLoveOncall wrote:
| There are so many "if you do that then it's not good" in this
| article that it actually seems like evals can in fact be all you
| need as long as you do them right?
|
| I build an LLM-based platform at work, with a lot of agents and
| datasources, and yet we still don't fit in any of those "ifs".
| tcdent wrote:
| This conversation always results in a ":shrug: I guess we'll
| never know" at the end.
|
| There's potentially never going to be a silver bullet approach to
| this, or something that satisfies our need for determinism as in
| unit testing, but we can still try.
|
| Would love to see as much effort put into this in the open source
| framework sense as there is being put into agentic workflows.
| intellectronica wrote:
| I think the author may be misunderstanding what practitioners
| mean when they say that "evals are all you need", because the
| term is overloaded.
|
| There are generic evals (like MMLU), for which the proper term is
| really "benchmark".
|
| But task-related evals, where you evaluate how a specific
| model/implementation is doing on performing a task in your
| project are, if not _all_ you need, at the very least the most
| important component by a wide margin. They do not _guarantee_
| software performance in the same way that unit tests do for
| traditional software. But they are the main mechanism we have for
| evolving a system towards being good enough to use it in
| production. I am not aware of any workable alternative.
| phillipcarter wrote:
| I think the most important part of this article is how people
| focus too much on evaluating interactions with the model, but not
| evaluating the whole system that enables a feature or workflow
| that uses an LLM.
|
| This is totally true! And I've talked with people who have found
| that "the LLM is a problem" only to find that upstream calls to
| services that produce data to be fed into the LLM were actually
| the ones causing problems.
| shahules wrote:
| It's an interesting article and I agree with some points you
| brought up here. But here are some of them to which I don't agree
| to
|
| 1. Evals are used throughout the article in the sense of LLM
| benchmarking, but this is not the point. One could effectively
| evaluate any AI system by building custom evals.
|
| 2. The purpose of evals is to help devs systematically improve
| their AI systems (at least how we look at it) not any of the ones
| listed in your article. It's not a one-time thing, it's a
| practice like the scientific method.
| groodt wrote:
| The article doesn't provide any alternatives?
|
| I think there are indeed many challenges when evaluating Compound
| AI Systems (http://bair.berkeley.edu/blog/2024/02/18/compound-ai-
| systems...)
|
| But evals in complex systems are the best we have at the moment.
| It's a "best-practice" just like all the forms of testing in the
| "test pyramid" (https://martinfowler.com/articles/practical-test-
| pyramid.htm...)
|
| Nothing is a silver bullet. Just hard won, ideally automated,
| integrated quality and verification checks, built deep into the
| system and SDLC.
___________________________________________________________________
(page generated 2025-03-03 23:01 UTC)