hngopher.com

       [HN Gopher] Show HN: Ragas - Open-source library for evaluating ...
       ___________________________________________________________________
        
       Show HN: Ragas - Open-source library for evaluating RAG pipelines
        
       Ragas is an open-source library for evaluating and testing RAG and
       other LLM applications. Github: https://docs.ragas.io/en/stable/,
       docs: https://docs.ragas.io/.  Ragas provides you with different
       sets of metrics and methods like synthetic test data generation to
       help you evaluate your RAG applications. Ragas started off by
       scratching our own itch for evaluating our RAG chatbots last year.
       _Problems Ragas can solve_  - How do you choose the best components
       for your RAG, such as the retriever, reranker, and LLM?  - How do
       you formulate a test dataset without spending tons of money and
       time?  We believe there needs to be an open-source standard for
       evaluating and testing LLM applications, and our vision is to build
       it for the community. We are tackling this challenge by evolving
       the ideas from the traditional ML lifecycle for LLM applications.
       _ML Testing Evolved for LLM Applications_  We built Ragas on the
       principles of metrics-driven development and aim to develop and
       innovate techniques inspired by state-of-the-art research to solve
       the problems in evaluating and testing LLM applications.  We don't
       believe that the problem of evaluating and testing applications can
       be solved by building a fancy tracing tool; rather, we want to
       solve the problem from a layer under the stack. For this, we are
       introducing methods like automated synthetic test data curation,
       metrics, and feedback utilisation, which are inspired by lessons
       learned from deploying stochastic models in our careers as ML
       engineers.  While currently focused on RAG pipelines, our goal is
       to extend Ragas for testing a wide array of compound systems,
       including those based on RAGs, agentic workflows, and various
       transformations.  Try out Ragas here
       https://colab.research.google.com/github/shahules786/openai-... in
       Google Colab. Read our docs - https://docs.ragas.io/ to know more
       We would love to hear feedback from the HN community :)
        
       Author : shahules
       Score  : 58 points
       Date   : 2024-03-21 15:48 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | dataexporter wrote:
       | Based on our initial analysis with RAGAS a few months ago, it
       | didn't provide the results that our team was expecting. Required
       | a lot of customisation on top of it. Nevertheless a pretty solid
       | library.
        
         | shahules wrote:
         | Hey, thanks for trying out Ragas. As an open-source library, we
         | are continuously improving from the feedback from the community
         | which I see as our primary strength. I am sure that Ragas is
         | not perfect yet, but I can assure you that it is 10x better
         | than it was a few months ago.
        
       | swyx wrote:
       | congrats on launching! i think my continuing struggle with
       | looking at Ragas as a company/library rather than a very
       | successful mental model is that the core of it is like 8 metrics
       | (https://github.com/explodinggradients/ragas/tree/main/src/ra...)
       | that are each 1-200 LOC. i can inline that easily in my app and
       | retain full control, or model that in langchain or haystack or
       | whatever.
       | 
       | why is Ragas a library and a company, rather than an overall
       | "standard" or philosophy (eg like Heroku's 12 Factor Apps) that
       | could maybe be more universally adopted without using the
       | library?
       | 
       | (just giving an opp to pitch some underappreciated benefits of
       | using this library)
        
         | shahules wrote:
         | Thank you for asking this question.
         | 
         | To answer this question, I will explain two directions of
         | Ragas.
         | 
         | The first one is the horizontal expansion of the library which
         | involves features like
         | 
         | - Giving you the ability to use any LLMs instantly without any
         | hassle
         | 
         | - Asynchronous evaluations, integrations with tracing tools,
         | etc
         | 
         | - Automatic support to adapt metrics to any language
         | 
         | The second is vertical expansion or adding more core features
         | like metrics to Ragas which includes.
         | 
         | - Synthetic test data generation: this is something that is
         | heavily loved by our community so we are continuously improving
         | the quality of it.
         | https://docs.ragas.io/en/stable/concepts/testset_generation....
         | 
         | Now, as we expand in both directions we aim to solve the
         | problem of how to evaluate and test compound systems. Now, to
         | solve this we will be innovating and working on features like
         | feedback utilization, automatically synthesizing assertions,
         | etc to solve this hard problem.
         | 
         | I hope I was able to answer your question. Would love to
         | discuss more.
        
           | swyx wrote:
           | cool cool. so 1) will be a direct langchain competitor, and
           | 2) is net new territory?
        
             | shahules wrote:
             | 1) Horizondal expansion and support are core to every
             | framework/library. This won't make us a competitor to LC,
             | we actually use langchain-core to support many of these
             | like supporting different LLMs. 2) We operate in a layer
             | underneath the stack of evals and testing because we want
             | to solve the problem from the ground up rather than
             | building a fancy tracing tool, which comes later in the
             | stack.
        
       | retrovrv wrote:
       | Phenomenal to see how Ragas has progressed. Congratulations on
       | the launch
        
         | shahules wrote:
         | Thank you.
        
       | jfisher4024 wrote:
       | Congratulations on the launch! I was unable to use this library:
       | I was trying to evaluate different non-OpenAI models and it
       | consistently failed due to malformed JSONs coming from the model.
       | 
       | Any thoughts about using different models? Is this just a
       | langchain limitation?
        
         | shahules wrote:
         | Thanks for your feedback. We have tested Ragas on alternatives
         | like Claude, Mixtral, Gemini, etc.
         | 
         | Although we support all LLMs supported by Langchain, sadly many
         | of the OSS models out of the box aren't capable of generating
         | JSON-type output which is important for us to ensure
         | reproducibility.
        
       | nkko wrote:
       | Congratulations on the launch of Ragas! This looks like an
       | incredibly valuable tool for the LLM community. As the library
       | continues to evolve, it will be interesting to see how it adapts
       | to handle the growing diversity of LLM architectures and use
       | cases.
        
         | shahules wrote:
         | Yes, this is an interesting challenge we are also excited
         | about.
        
       | rhogar wrote:
       | Congratulations on the launch! Personally would love to see a
       | rough estimates of the expected number of requests and tokens
       | required to run tasks like synthetic data generation for
       | different amounts of data. Though this is likely highly variable,
       | would like to have a loose idea of possible incurred costs and
       | execution time.
        
         | shahules wrote:
         | Hey, this is a highly requested feature. We will be
         | implementing it soon. Something like a rough estimate is what
         | we are planning to do.
        
       | pawanapg wrote:
       | Also check out DeepEval... our team has been using it for a
       | while, and it's been working well for us because we can evaluate
       | any LLMs, something this library doesn't seem to support
       | (https://github.com/confident-ai/deepeval).
        
         | shahules wrote:
         | Hey, DeepEval is interesting. What do you mean by "evaluating
         | any LLMs"?
        
       | redskyluan wrote:
       | Great product and great progress.
       | 
       | The first step to build rage is always to evaluate.
       | 
       | Except all the current evaluations, cost and perf should also be
       | part of evaluations
        
       ___________________________________________________________________
       (page generated 2024-03-21 23:00 UTC)