[HN Gopher] Show HN: PromptTools - open-source tools for evaluat...
       ___________________________________________________________________
        
       Show HN: PromptTools - open-source tools for evaluating LLMs and
       vector DBs
        
       Hey HN! We're Kevin and Steve. We're building PromptTools
       (https://github.com/hegelai/prompttools): open-source, self-
       hostable tools for experimenting with, testing, and evaluating
       LLMs, vector databases, and prompts.  Evaluating prompts, LLMs, and
       vector databases is a painful, time-consuming but necessary part of
       the product engineering process. Our tools allow engineers to do
       this in a lot less time.  By "evaluating" we mean checking the
       quality of a model's response for a given use case, which is a
       combination of testing and benchmarking. As examples: - For
       generated JSON, SQL, or Python, you can check that the output is
       actually JSON, SQL, or executable Python. - For generated emails,
       you can use another model to assess the quality of the generated
       email given some requirements, like whether or not the email is
       written professionally. - For a question-answering chatbot, you can
       check that the actual answer is semantically similar to an expected
       answer.  At Google, Steve worked with HuggingFace and Lightning to
       support running the newest open-source models on TPUs. He realized
       that while the open-source community was contributing incredibly
       powerful models, it wasn't so easy to discover and evaluate them.
       It wasn't clear when you could use Llama or Falcon instead of
       GPT-4. We began looking for ways to simplify and scale this
       evaluation process.  With PromptTools, you can write a short Python
       script (as short as 5 lines) to run such checks across models,
       parameters, and prompts, and pass the results into an evaluation
       function to get scores. All these can be executed on your local
       machine without sending data to third-parties. Then we help you
       turn those experiments into unit tests and CI/CD that track your
       model's performance over time.  Today we support all of the major
       model providers like OpenAI, Anthropic, Google, HuggingFace, and
       even LlamaCpp, and vector databases like ChromaDB and Weaviate. You
       can evaluate responses via semantic similarity, auto-evaluation by
       a language model, or structured output validations like JSON and
       Python. We even have a notebook UI for recording manual feedback.
       Quickstart:                 pip install prompttools       git clone
       https://github.com/hegelai/prompttools.git       cd prompttools &&
       jupyter notebook examples/notebooks/OpenAIChatExperiment.ipynb
       For detailed instructions, see our documentation at
       https://prompttools.readthedocs.io/en/latest/.  We also have a
       playground UI, built in streamlit, which is currently in beta:
       https://github.com/hegelai/prompttools/tree/main/prompttools....
       Launch it with:                 pip install prompttools       git
       clone https://github.com/hegelai/prompttools.git       cd
       prompttools && streamlit run prompttools/ui/playground.py
       We'd love it if you tried our product out and let us know what you
       think! We just got started a month ago and we're eager to get
       feedback and keep building.
        
       Author : krawfy
       Score  : 142 points
       Date   : 2023-08-01 16:23 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | esafak wrote:
       | I'd like to see support for qdrant.
        
         | krawfy wrote:
         | We've actually been in contact with the qdrant team about
         | adding it to our roadmap! Andre (CEO) was asking for an
         | integration. If you want to work on the PR, we'd be happy to
         | work with you and get that merged in
        
       | mmaia wrote:
       | I like that it's not limited to single prompts and allows to have
       | chat messages. It would be great if `OpenAIChatExperiment` could
       | also handle OpenAI's function calling.
        
       | pk19238 wrote:
       | This is super cool man!
        
         | nivekt wrote:
         | Thanks! We would appreciate any feedback or feature requests!
        
       | neelm wrote:
       | Something like this is going to be needed to evaluate models
       | effectively. Evaluation should be integrated into automated
       | pipelines/workflows that can scale across models and datasets.
        
         | krawfy wrote:
         | Thanks Neel! We totally agree that automated evals will become
         | an essential part of production LLM systems.
        
       | fatso784 wrote:
       | I like the support for Vector DBs and LLaMa-2. I'm curious as to
       | whether and what influences compelled PromptTools, and how it
       | differs from other tools in this space. For context, we've also
       | released a prompt engineering IDE, ChainForge, which is open-
       | source and has many of the features here, such as querying
       | multiple models at once, prompt templating, evaluating responses
       | with Python/JS code and LLM scorers, plotting responses, etc
       | (https://github.com/ianarawjo/ChainForge and a playground at
       | http://chainforge.ai).
       | 
       | One big problem we're seeing in this space is over-trust in LLM
       | scorers as 'evaluators'. I've personally seen that minor tweaks
       | to a scoring prompt can sometimes result in vastly different
       | evaluation 'results.' Given recent debacles
       | (https://news.ycombinator.com/item?id=36370685), I'm wondering
       | how we can design LLMOps tools for evaluation which both support
       | the use of LLMs as scorers, but also caution users about their
       | results. Are you thinking similarly about this question, or seen
       | usability testing which points to over-trust in 'auto-evaluators'
       | as an emerging problem?
        
         | hashemalsaket wrote:
         | One approach we've been working on is having multiple LLMs
         | score each other. Here is the design with an example of how
         | that works: https://github.com/HashemAlsaket/prompttools/pull/1
         | 
         | In short: Pick top 50% responses, LLMs score each other, repeat
         | until top response remains
        
           | fatso784 wrote:
           | What does 'top 50%' responses mean here, though? You'd need
           | to have a ground truth of how 'good' each score was to
           | calculate that --and if you had ground truth, no need to use
           | an LLM evaluator to begin with.
           | 
           | If you mean trusting the LLM scores to pick the 50% 'top'
           | responses they grade, this doesn't get around the issue of
           | overly trusting the LLM's scores.
        
             | hashemalsaket wrote:
             | For now, the design is basic:
             | 
             | User to LLM: "Rate this response to the following prompt on
             | a scale of 1-10, where 1 is a poor response and 10 is a
             | great response: [response]"
             | 
             |  _LLM rates responses of all other LLMs
             | 
             | _ All other LLMs do the same
             | 
             | Then we take the average score of each response. The LLMs
             | that produced the top 50% of responses will respond again
             | until one response with the highest score remains.
        
         | krawfy wrote:
         | Great question, chainforge looks interesting!
         | 
         | We offer auto-evals as one tool in the toolbox. We also
         | consider structured output validations, semantic similarity to
         | an expected result, and manual feedback gathering. If anything,
         | I've seen that people are more skeptical of LLM auto-eval
         | because of the inherent circularity, rather than over-trusting
         | it.
         | 
         | Do you have any suggestions for other evaluation methods we
         | should add? We just got started in July and we're eager to
         | incorporate feedback and keep building.
        
           | fatso784 wrote:
           | Thanks for the clarification! Yes, I see now that auto-evals
           | here is more AI agent-ish, than a one-shot approach. Still
           | has the trust issue.
           | 
           | For suggestions, one thing I'm curious about is how we can
           | have out-of-the-box benchmark datasets and do this
           | responsibly. ChainForge supports most OpenAI evals, but from
           | adding this we realized the quality of OpenAI Evals is really
           | _sketchy_... duplicate data, questionable metrics, etc.
           | OpenAI has shown that trusting the community to make
           | benchmarks is perhaps not a good idea; we should instead make
           | it easier for scientists/engineers to upload their benchmarks
           | and make it easier for others to run them. That's one
           | thought, anyway.
        
       | catlover76 wrote:
       | Super cool, the need for tooling like this is something one
       | realizes pretty quickly when starting to build apps that leverage
       | LLMs.
        
         | krawfy wrote:
         | Glad you think so, we agree! If you end up trying it out, we'd
         | love to hear what you think, and what other features you'd like
         | to see.
        
       | tikkun wrote:
       | This looks great, thanks
       | 
       | See also this related tool:
       | https://news.ycombinator.com/item?id=36907074
        
         | krawfy wrote:
         | Awesome! Let us know if there's anything from that tool that
         | you think we should add to PromptTools
        
       | politelemon wrote:
       | Similar tool I was about to look at:
       | https://github.com/promptfoo/promptfoo
       | 
       | I've seen this in both tools but I wasn't able to understand: In
       | the screenshot with feedback, I see thumbs up and thumbs down
       | options. Where do those values go, what's the purpose? Does it
       | get preserved across runs? It's just not clicking in my head.
        
         | krawfy wrote:
         | For now, we just aggregate those across the models / prompts /
         | templates you're evaluating so that you can get an aggregate
         | score. You can export to CSV, JSON, MongoDB, or Markdown files,
         | and we're working on more persistence features so that you can
         | get a history of which models / prompts / templates you gave
         | the best scores to, and keep track of your manual evaluations
         | over time.
        
       | 8awake wrote:
       | Great work! We will make use of that with https://www.formula8.ai
        
         | nivekt wrote:
         | Thank you! If you have any feedback or feature requests, don't
         | hesitate to reach out.
        
       ___________________________________________________________________
       (page generated 2023-08-01 23:01 UTC)