[HN Gopher] Show HN: Paramount - Human Evals of AI Customer Support
       ___________________________________________________________________
        
       Show HN: Paramount - Human Evals of AI Customer Support
        
       Hey HN, Hakim here from Fini (YC S22), a startup focused on
       providing automated customer support bots for enterprises that have
       a high volume of support requests.  Today, one of the largest use
       cases of LLMs is for the purpose of automating support. As the
       space has evolved over the past year, there has subsequently been a
       need for evaluations of LLM outputs - and a sea of LLM Evals
       packages have been released. "LLM evals" refer to the evaluation of
       large language models, assessing how well these AI systems
       understand and generate human-like text. These packages have
       recently relied on "automatic evals," where algorithms (usually
       another LLM) automatically test and score AI responses without
       human intervention.  In our day to day work, we have found that
       Automatic Evals are not enough to get the required 95% accuracy for
       our Enterprise customers. Automatic Evals are efficient, but still
       often miss nuances that only human expertise can catch. Automatic
       Evals can never replace the feedback of a trained human who is
       deeply knowledgeable on an organization's latest product releases,
       knowledgebase, policies and support issues. The key to solve this
       is to stop ignoring the business side of the problem, and start
       involving knowledgeable experts in the evaluation process.  That is
       why we are releasing Paramount - an Open Source package which
       incorporates human feedback directly into the evaluation process.
       By simplifying the step of gathering feedback, ML Engineers can
       pinpoint and fix accuracy issues (prompts, knowledgebase issues)
       much faster. Paramount provides a framework for recording LLM
       function outputs (ground truth data) and facilitates human agent
       evaluations through a simple UI, reducing the time to identify and
       correct errors.  Developers can integrate Paramount with a Python
       decorator that logs LLM interactions into a database, followed by a
       straightforward UI for expert review. This process aids the
       debugging and validation phase of launching accurate support bots.
       We'd love to hear what you think!
        
       Author : hakimk
       Score  : 15 points
       Date   : 2024-06-13 18:20 UTC (4 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       ___________________________________________________________________
       (page generated 2024-06-13 23:00 UTC)