[HN Gopher] Evals are not all you need
       ___________________________________________________________________
        
       Evals are not all you need
        
       Author : amarble
       Score  : 32 points
       Date   : 2025-03-03 18:09 UTC (4 hours ago)
        
 (HTM) web link (www.marble.onl)
 (TXT) w3m dump (www.marble.onl)
        
       | iLoveOncall wrote:
       | There are so many "if you do that then it's not good" in this
       | article that it actually seems like evals can in fact be all you
       | need as long as you do them right?
       | 
       | I build an LLM-based platform at work, with a lot of agents and
       | datasources, and yet we still don't fit in any of those "ifs".
        
       | tcdent wrote:
       | This conversation always results in a ":shrug: I guess we'll
       | never know" at the end.
       | 
       | There's potentially never going to be a silver bullet approach to
       | this, or something that satisfies our need for determinism as in
       | unit testing, but we can still try.
       | 
       | Would love to see as much effort put into this in the open source
       | framework sense as there is being put into agentic workflows.
        
       | intellectronica wrote:
       | I think the author may be misunderstanding what practitioners
       | mean when they say that "evals are all you need", because the
       | term is overloaded.
       | 
       | There are generic evals (like MMLU), for which the proper term is
       | really "benchmark".
       | 
       | But task-related evals, where you evaluate how a specific
       | model/implementation is doing on performing a task in your
       | project are, if not _all_ you need, at the very least the most
       | important component by a wide margin. They do not _guarantee_
       | software performance in the same way that unit tests do for
       | traditional software. But they are the main mechanism we have for
       | evolving a system towards being good enough to use it in
       | production. I am not aware of any workable alternative.
        
       | phillipcarter wrote:
       | I think the most important part of this article is how people
       | focus too much on evaluating interactions with the model, but not
       | evaluating the whole system that enables a feature or workflow
       | that uses an LLM.
       | 
       | This is totally true! And I've talked with people who have found
       | that "the LLM is a problem" only to find that upstream calls to
       | services that produce data to be fed into the LLM were actually
       | the ones causing problems.
        
       | shahules wrote:
       | It's an interesting article and I agree with some points you
       | brought up here. But here are some of them to which I don't agree
       | to
       | 
       | 1. Evals are used throughout the article in the sense of LLM
       | benchmarking, but this is not the point. One could effectively
       | evaluate any AI system by building custom evals.
       | 
       | 2. The purpose of evals is to help devs systematically improve
       | their AI systems (at least how we look at it) not any of the ones
       | listed in your article. It's not a one-time thing, it's a
       | practice like the scientific method.
        
       | groodt wrote:
       | The article doesn't provide any alternatives?
       | 
       | I think there are indeed many challenges when evaluating Compound
       | AI Systems (http://bair.berkeley.edu/blog/2024/02/18/compound-ai-
       | systems...)
       | 
       | But evals in complex systems are the best we have at the moment.
       | It's a "best-practice" just like all the forms of testing in the
       | "test pyramid" (https://martinfowler.com/articles/practical-test-
       | pyramid.htm...)
       | 
       | Nothing is a silver bullet. Just hard won, ideally automated,
       | integrated quality and verification checks, built deep into the
       | system and SDLC.
        
       ___________________________________________________________________
       (page generated 2025-03-03 23:01 UTC)