hngopher.com

       [HN Gopher] Launch HN: Relari (YC W24) - Identify the root cause...
       ___________________________________________________________________
        
       Launch HN: Relari (YC W24) - Identify the root cause of problems in
       LLM apps
        
       Hi HN, we are the founders of Relari, the company behind
       continuous-eval (https://github.com/relari-ai/continuous-eval), an
       evaluation framework that lets you test your GenAI systems at the
       component level, pinpointing issues where they originate.  We
       experienced the need for this when we were building a copilot for
       bankers. Our RAG pipeline blew up in complexity as we added
       components: a query classifier (to triage user intent), multiple
       retrievers (to grab information from different sources), filtering
       LLM (to rerank / compress context), a calculator agent (to call
       financial functions) and finally the synthesizer LLM that gives the
       answer. Ensuring reliability became more difficult with each of
       these we added.  When a bad response was detected by our answer
       evaluator, we had to backtrack multiple steps to understand which
       component(s) made a mistake. But this quickly became unscalable
       beyond a few samples.  I did my Ph.D. in fault detection for
       autonomous vehicles, and I see a strong parallel between the
       complexity of autonomous driving software and today's LLM
       pipelines. In self-driving systems, sensors, perception,
       prediction, planning, and control modules are all chained together.
       To ensure system-level safety, we use granular metrics to measure
       the performance of each module individually. When the vehicle makes
       an unexpected decision, we use these metrics to pinpoint the
       problem to a specific component. Only then we can make targeted
       improvements, systematically.  Based on this thinking, we developed
       the first version of continuous-eval for ourselves. Since then
       we've made it more flexible to fit various types of GenAI
       pipelines. Continuous-eval allows you to describe
       (programmatically) your pipeline and modules, and select metrics
       for each module. We developed 30+ metrics to cover retrieval, text
       generation, code generation, classification, agent tool use, etc.
       We now have a number of companies using us to test complex
       pipelines like finance copilots, enterprise search, coding agents,
       etc.  As an example, one customer was trying to understand why
       their RAG system did poorly on trend analysis queries. Through
       continuous-eval, they realized that the "retriever" component was
       retrieving 80%+ of all relevant chunks, but the "reranker"
       component, that filters out "irrelevant" context, was dropping that
       to below 50%. This enabled them to fix the problem, in their case
       by skipping the reranker for certain queries.  We've also built
       ensemble metrics that do a surprisingly good job of predicting user
       feedback. Users often rate LLM-generated answers by giving a thumbs
       up/down about how good the answer was. We train our custom metrics
       on this user data, and then use those metrics to generate thumbs
       up/down ratings on future LLM answers. The results turn out to be
       90% aligned with what the users say. This gives developers a
       feedback loop from production data to offline testing and
       development. Some customers have found this to be our most unique
       advantage.  Lastly, to make the most out of evaluation, you should
       use a diverse dataset--ideally with ground truth labels for
       comprehensive and consistent assessment. Because ground truth
       labels are costly and time-consuming to curate manually, we also
       have a synthetic data generation pipeline that allows you to get
       started quickly. Try it here
       (https://www.relari.ai/#synthetic_data_demo)  What's been your
       experience testing and iterating LLM apps? Please let us know your
       thoughts and feedback on our approaches (modular framework,
       leveraging user feedback, testing with synthetic data).
        
       Author : antonap
       Score  : 70 points
       Date   : 2024-03-08 14:00 UTC (9 hours ago)
        
       | atomon wrote:
       | This looks very cool, will try it out on my next project.
       | 
       | There have been a number of solutions popping up to address this
       | problem, and I think the need is very real. Decomposing these LLM
       | tasks into subtasks seems to be one of the best ways to work
       | around the shortcomings of LLMs in production apps
       | (hallucinations, context window limits, etc). But then you end up
       | with complicated pipelines that are difficult to debug, improve,
       | reason about, etc.
        
         | antonap wrote:
         | Indeed - decomposition improves reliability but also makes the
         | testing more challenging. That's why we made the framework
         | modular! Let us know of any feedback as you try it out!
        
       | lancehasson wrote:
       | Looks very cool! Will check it out later. FYI - search on the
       | docs isn't loading on safari for me
        
         | antonap wrote:
         | Thank you for catching that! Looking into it now.
        
       | petervandijck wrote:
       | Love that you're tackling this and congrats on the launch.
       | 
       | Feedback: the synthetic data demo shows nicely what that piece
       | does, but the page is really messy, it would be nice to have that
       | demo cleanly on a dedicated page:
       | https://www.relari.ai/#synthetic_data_demo
        
         | antonap wrote:
         | Thank you for the feedback! That's a great suggestion. We do
         | want to make the demo into a separate page, and also add a live
         | evaluation demo using the synthetic data generated on the fly.
        
       | nextworddev wrote:
       | Ok this is entering the same space as Arize AI which I have been
       | using for a year. What's the main benefit of this product?
        
         | antonap wrote:
         | Arize is a great tool for observability, and their open source
         | product, Phoenix, offers many great features for LLM evaluation
         | as well.
         | 
         | Some key unique advantages we offer:
         | 
         | - Component-level evaluation, not just observability: Many
         | great tools on the market can help you observe different
         | components (or execution steps) in a GenAI system for each data
         | sample. What we offer on top of that is the ability to do
         | automatic evaluation and have metrics for each step of the
         | pipeline. For example, you will be able to have metrics on the
         | accuracy of agent tool usage, precision / recall for each
         | retriever step, and relevant metrics on each LLM call.
         | 
         | - Leverage user feedback for offline evaluation: We allow you
         | to create custom metrics based on your past user feedback data.
         | Unlike predefined metrics, these custom metrics are trained to
         | learn your specific user preferences. In a sense, these metrics
         | simulate user ratings.
         | 
         | - Synthetic Data Generation: Large amounts of synthetic data
         | can help you stress test your AI system beyond your existing
         | data. They also come in greater granularity than human curated
         | datasets and can help you test and validate.
        
           | esafak wrote:
           | I always recommend a comparison page. Help prospects decide.
        
             | antonap wrote:
             | Great suggestion, thanks!
        
       | swyx wrote:
       | qtn: why Launch so early? why not Show first?
        
         | antonap wrote:
         | Originally, we were going to do a Show HN for the modular
         | evaluation and another Show HN for the synthetic data, because
         | our understanding is that the Show HNs are for individual
         | projects. But then we realized that it's the combination of the
         | various pieces that brings the most value, so we decided to put
         | them together as a single Launch HN instead.
        
       ___________________________________________________________________
       (page generated 2024-03-08 23:00 UTC)