[HN Gopher] Launch HN: Talc AI (YC S23) - Test Sets for AI
___________________________________________________________________
Launch HN: Talc AI (YC S23) - Test Sets for AI
Hey all! Max and Matt here from Talc AI. We do automated QA for
anything built on top of an LLM. Check out our demo:
https://talc.ai/demo We've found that it's very difficult to know
how well LLM applications (and especially RAG systems) are going to
work in the wild. Many companies tackle this by having developers
or contractors run tests manually. It's a slow process that holds
back development, and often results in unexpected behavior when the
application ships. We've dealt with similar problems before; Max
was a staff engineer working on systematic technical solutions for
privacy problems at facebook, and Matt worked on ML ops on
facebooks' election integrity team, helping run classifiers that
handled trillions of data points. We learned that even the best
predictive systems need to be deeply understood and trusted to be
useful to product teams, and set out to build the same
understanding in AI. To solve this, we take ideas from academia on
how to benchmark the general capabilities of language models, and
apply them to generating domain specific test cases that run
against your actual prompts and code. Consider an analogy: If
you're a lawyer, we don't need to be lawyers to open up a legal
textbook and test your knowledge of the content. Similarly if
you're building a legal AI application, we don't need to build your
application to come up with an effective set of tests that can
benchmark your performance. To make this more concrete - when you
pick a topic in the demo, we grab the associated wikipedia page and
extract a bunch of facts from it using a classic NLP technique
called "named entity recognition". For example if you picked
FreeBASIC, we might extract the following line from it:
Source of truth: "IDEs specifically made for FreeBASIC include
FBide and FbEdit,[5] while more graphical options include WinFBE
Suite and VisualFBEditor." This line is our source of
truth. We then use an LLM to work backwards from this fact into a
question and answer: Question: "What programming
language are the IDEs WinFBE Suite and FbEdit designed to support?"
Reference Answer: "FreeBasic" We can then evaluate
accurately by comparing the reference answer and the original
source of truth- this is how we generate "simple" questions in the
demo. In production we're building this same functionality on our
customers' knowledge base instead of wikipedia. We then employ a
few different strategies to generate questions - these range from
simple factual questions like "how much does the 2024 chevy tahoe
cost", to complex questions like "What would a mechanic have to do
to fix the recall on my 2018 Golf?" These questions are based on
facts extracted from your knowledge base and real customer
examples. This testing and grading process is fast - it's driven
by a mixture of LLMs and traditional algorithms, and can turn
around in minutes. Our business model is pretty simple - we charge
for each test created. If you opt to use our grading product as
well we charge for each example graded against the test. We're
excited to hear what the HN community thinks - please let us know
in the comments if you have any feedback, questions or concerns!
Author : maxrmk
Score : 76 points
Date : 2024-01-18 14:32 UTC (8 hours ago)
| sherlock_h wrote:
| Looks interesting. How do you rate the correctness? Some complex
| LLM answers seemed to be correct but not in as much detail as the
| expected answer.
|
| How do you generate the answers? Does the model have access to
| the original source of truth (like in RAG apps)?
|
| And in your examples what model do you actually use?
| maxrmk wrote:
| Great questions here!
|
| > How do you rate the correctness? Some complex LLM answers
| seemed to be correct but not in as much detail as the expected
| answer.
|
| We support two different modes: a strict pass/fail where an
| answer has to have all of the information we expect, and a
| rubric based mode where answers are bucketed into things like
| "partially correct" or "wrong but helpful".
|
| To be honest, we also get the grading wrong sometimes. If you
| see anything egregious please email me the topic you used at
| max@talc.ai
|
| > How do you generate the answers? Does the model have access
| to the original source of truth (like in RAG apps)?
|
| You guessed right here - we connect to the knowledge base like
| a RAG app. We also use this to generate the questions -- think
| of it like reading questions out of a textbook to quiz someong.
|
| > And in your examples what model do you actually use?
|
| We use multiple models for the question generation, and are
| still evaluating what works best. For the demo, we are
| "quizzing" openai's 3.5 turbo model.
| tikkun wrote:
| I like the chevy tahoe callback - I'm assuming that's a reference
| to the chevy dealership that used an LLM and had people doing
| prompt tricks to get the chatbot to offer them a chevy tahoe for
| $1.
|
| The specificity in your writing above "to make this more
| concrete" about how it works was also helpful for understanding
| the product.
| maxrmk wrote:
| You're exactly right about the chevy tahoe reference. I wasn't
| sure if anyone would get it. I liked that post a lot, because
| as much as I think LLMs are going to be useful they have
| limitations that we haven't solved yet.
| logiduck wrote:
| For the chevy tahoe example, you are referencing the dealership,
| but in that case it wasn't a case of the implementation failing
| to do a positive test for fact extraction, but to test the
| guardrails.
|
| Aren't the guardrail tests much harder since they are open-ended
| and have to guard against unknown prompt injections and the test
| of facts much simpler?
|
| I think a test suite that guards against the infinite surface
| area is more valuable then testing if a question matches a
| reference answer.
|
| Interested to how you view testing against giving a wrong answer
| outside of the predefined scope as opposed to testing that all
| the test questions match a reference.
| maxrmk wrote:
| Totally - certain types of failures are much harder to test
| than others.
|
| We have a couple of different test generation strategies. As
| you can see in the demo and examples, the most basic one is
| "ask about a fact".
|
| Two of our other strategies are closer to what you're asking
| for:
|
| 1. tests that try to deliberately induce hallucination by
| implying some fact that isn't in the knowledge base. For
| example "do I need a pilots license to activate the flight mode
| on the new chevy tahoe?" implies the existence of a feature
| that doesn't exist (yet). This was really hard to get right,
| and we have some coverage here but are still improving it.
|
| 2. actively malicious interactions that try to override facts
| in the knowledge base. These are easy to generate.
| logiduck wrote:
| Cool.
|
| Just as some feedback I did the demo with the "VW Beetle"
| topic and one of the test cases was:
|
| > Question: How did the introduction of the Volkswagen Golf
| impact the production and sales of the Beetle?
|
| > Expected: The introduction of the Volkswagen Golf, a front-
| wheel drive hatchback, marked a shift in consumer preference
| towards more modern car designs. The Golf eventually became
| Volkswagen's most successful model since the Beetle, leading
| to a decline in Beetle production and sales. Beetle
| production continued in smaller numbers at other German
| factories until it shifted to Brazil and Mexico, where low
| operating costs were more important.
|
| > GPT Response: The introduction of the Volkswagen Golf
| impacted the production and sales of the Beetle by gradually
| decreasing demand for the Beetle and shifting focus towards
| the Golf.
|
| It seems that the GPT responses matches the expected but it
| was graded as incorrect. But it seems to me the GPT answer is
| correct.
|
| In fact a couple of the other answers are marked incorrectly:
|
| > Question: What was the Volkswagen Beetle's engine layout? >
| Expected Answer: Rear-engine, rear-wheel-drive layout > GPT
| Response: The Volkswagen Beetle had a rear-engine layout.
|
| was marked as incorrect.
| maxrmk wrote:
| Will take a look, thanks!
| logiduck wrote:
| Also, just a random thing that I thought of playing
| around with it is a few days ago a guy posted about an AI
| quiz generator for education.
|
| If you ever need to pivot, it seems like this would do
| reasonably well in the education space also.
| maxrmk wrote:
| Yeah, someone is going to build this. We considered
| quizzing the user on the topic instead of chatgpt for our
| demo. It's a lot of fun to test your knowledge on any
| topic, but it was a worse demo because it was way less
| related to our current product.
|
| I think that one of the obvious next big spaces for LLMs
| is education. I already find chatgpt useful when learning
| myself. That being said, I'm terrified of trying to sell
| things to schools.
| andy99 wrote:
| The first thing that popped into my head is what do you do with
| the test results? Specifically, how do they feed back into model
| improvement in a way that avoids overfitting? Do you think having
| some kind of classical "holdout" question set is enough?
| Especially with RAG, I'd wonder with the levers that are
| available (prompt, chunking strategy, ...) if you define a bunch
| of test questions do you end up overfitting to them, or to the
| current data set. How can findings be extrapolated to new
| situations?
| maxrmk wrote:
| Ah! Think of this more like software testing that goes in CI/CD
| rather than an ML test or validation set. We're providing this
| testing for applications built on top of language models.
|
| For example if you're a SWE working on bing chat, you can make
| a change to how retrieval works and quickly know how it
| affected accuracy on a range of different test scenarios. This
| kind of evaluation is done by contractors today, and they are
| slow and inaccurate.
| dkindler wrote:
| Here's an example where the GPT response was correct, but was
| marked as incorrect: https://ibb.co/tMGxcf3
| maxrmk wrote:
| Thanks for flagging!
| typpo wrote:
| Congrats on the launch!
|
| I've been interested in automatic testset generation because I
| find that the chore of writing tests is one of the reasons people
| shy away from evals. Recently landed eval testset generation for
| promptfoo (https://github.com/typpo/promptfoo), but it is non-RAG
| so more simplistic than your implementation.
|
| Was also eyeballing this paper https://arxiv.org/abs/2401.03038,
| which outlines a method for generating asserts from prompt
| version history that may also be useful for these eval tools.
| maxrmk wrote:
| Thanks! I've been following promptfoo, so I'm glad to see you
| here. In addition to automatic evals I think every engineer and
| PM using LLMs should be looking at as many real responses as
| they can _every day_, and promptfoo is a great way to do that.
| moinism wrote:
| Congrats on the launch! Just tried the demo and it looks
| impressive. Good luck.
|
| Are you by any chance hiring global-remote, full-stack/front-end
| devs? Would love to work with you guys.
| maxrmk wrote:
| Thanks! We aren't hiring right now, but if you shoot me an
| email at max@talc.ai I'll follow up in a few months.
| quadcore wrote:
| Now someone has to test talc AI. I can do it.
|
| Impressive demo and business idea, congrats, good luck!
| maxrmk wrote:
| who tests the testing tool?
|
| Thanks though -- and let us know if you hit any issues while
| playing around with the demo!
| pchunduri6 wrote:
| I just tried the demo, and it looks great! Congrats on the
| launch!
|
| I have a couple of questions:
|
| 1) How often do you find that the LLM fails to generate the
| correct question-answer pairs? The biggest challenge I'm facing
| with LLM-based evaluation is the variability in LLM performance.
| I've found that the same prompt results in different LLM
| responses over multiple runs. Do you have any insights on this
| issue and how to address it?
|
| 2) Sometimes, the domain expert generating the test set might not
| be well-equipped to grade the answers. Consider a customer-facing
| chatbot application. The RAG app might be focused on very
| specific user information that might be hard to verify or attest
| by the test set creator. Do you think there are ways to make this
| grading process easier?
| matt_lee wrote:
| Thanks! Max's cofounder chiming in here.
|
| 1) There's an interesting subtlety in the phrase "the correct
| question-answer pairs". While we don't often find factually
| incorrect pairs because of how we're running the pipeline, the
| bigger question is whether or not the pairs we generated are
| "the" correct ones -- if they are relevant and helpful. This
| takes some manual tweaking at the moment.
|
| Inconsistent outputs over different runs are definitely an
| issue, but most teams we've worked with barely even have the
| CI/CD practice to be able to measure that rigorously. As we
| mature we'll aim to tackle flakiness of tests (and models) over
| time, but a bigger challenge has been getting regular tests
| like these set up in the first place.
|
| 2) In this scenario, we go to the documents powering a RAG
| application to both generate and grade answers. For example,
| the knowledge base might know that (1) product A is being
| recalled, and (2) customer #4 is asking for a warranty claim on
| product A. Using those two bits of information, we might
| generate a scenario that tests whether or not customer #4 gets
| the claim fulfilled. In other words, specific user information
| is simulated/used during the test set creation.
| julesvr wrote:
| Congrats on the launch!
|
| On your pricing model, as it's usage based, don't you incentivize
| your customers to use your product as little as possible?
| Wouldn't it be better to have limited tiers with fixed
| annual/monthly recurring rates? Also, do you sell to enterprise?
| I assume these would like this setup even more as the rates are
| predefined and they have a budget they have to deal with.
|
| I'm currently developing my own pricing model and these are some
| issues I'm struggling with, so curious what you think.
| matt_lee wrote:
| Not a pricing expert by any means, but here were some of the
| considerations we thought of when we made that decision:
|
| 1. There's plenty of usage based infrastructure/dev tools (ie.
| AWS, databricks), so I don't think we incentivize minimal
| usage.
|
| 2. The value we're providing feels directly tied to how much
| testing we're running, so when we tried to construct tiers they
| didn't feel helpful.
|
| 3. In our experience, enterprises have been fine with usage
| based pricing. They're already paying for human QA / labelling
| on usage based terms (even if it's part of a larger fixed
| contract), so our pricing isn't a deviation for them.
|
| Open to thoughts if anyone has them!
| Robotenomics wrote:
| Very, very impressive.. I ran a couple of tests and on the
| complex it received 80% although I would say it was harsh as the
| answer could be said to be correct- although I found the
| questions generated rather simple not complex.
|
| The 2nd test it was 100% incorrect for the complex questions!
| However when I checked directly with gpt-4 based upon the
| questions rendered it answered 100% correct. Could that be due to
| my custom settings in gpt4? Will run it with university students.
| Fascinating work
| matt_lee wrote:
| Thanks for giving it a whirl!
|
| I agree that the current grading is a bit harsh -- the rubric
| we're using in this demo is fairly rudimentary. What we've seen
| be more helpful is a range of grades along the lines of correct
| / correct but unhelpful / correct but incomplete / incorrect.
| This somewhat depends on individual use cases though.
|
| Let me know what questions generated you thought could be more
| complex! We're always working on improving our ability to
| explore the knowledge space for challenging questions.
| koeng wrote:
| I love your demo for this. It's one of the best demos I've ever
| come across in a launch HN. Very easy to understand and use. It
| seems to suffer with more complex questions though. For example:
|
| Question: Why does the pUC19 plasmid have a high copy number in
| bacterial cells?
|
| Expected answer: The pUC19 plasmid has a high copy number due to
| the lack of the rop gene and a single point mutation in the
| origin of replication (ori) derived from the plasmid pMB1.
|
| GPT response: The pUC19 plasmid has a high copy number in
| bacterial cells due to the presence of the pUC origin of
| replication, which allows for efficient and rapid replication of
| the plasmid.
|
| Both are technically correct - the expected answer is simply more
| detailed about the pUC origin, but both would be considered
| correct. It seems difficult to test things like this, but maybe
| that's just not possible to really get correct.
|
| I wonder how well things like FutureHouse's wikicrow will work
| for summarizing knowledge better -
| https://www.futurehouse.org/wikicrow - and how that could be
| benchmarked against Talc
| matt_lee wrote:
| Thank you for the kind words!
|
| One of my regrets about the demo is that we paid a lot of
| attention to showing off our ability to generate high quality
| Q/A pairs, but not nearly as much to showing what a thoughtful
| and thorough grading rubric can do.
|
| It's totally possible to do a high quality grading given a
| rubric that sets expectations! Great implementations we've seen
| use categories like correct / correct but incomplete / correct
| but unhelpful / incorrect to better label the situation you
| describe. We've found that we can grade with much more nuance
| given a good rubric and categories, but unfortunately didn't
| focus on that side of things in the demo
|
| I'm not familiar with wikicrow, will check it out!
| bestai wrote:
| I think allowing other languages besides English could be a good
| idea.
| maxrmk wrote:
| Good idea! There's no limitation in the generation or grading,
| but we didn't set up the search to support this. I'll see if
| it's possible to enable this in the wikipedia search component.
| Imnimo wrote:
| I tried the demo with Cal Ripken Jr. I was surprised by some of
| the complex questions:
|
| >Which MLB player won the Sporting News MLB Rookie of the Year
| Award as a pitcher in 1980, and who did Cal Ripken Jr. surpass to
| hold the record for most home runs hit as a shortstop?
|
| >What team did Britt Burns play for in the minor leagues before
| making his MLB debut, and in what year did Cal Ripken Jr. break
| the consecutive games played record?
|
| >Who was the minor league pitching coordinator for the Houston
| Astros until 2010, and what significant baseball record did Cal
| Ripken Jr. break in 1995?
|
| All five questions are a combination of a question about a Britt
| Burns fact and an unrelated Cal Ripken fact.
|
| Why is this? Britt Burns doesn't seem to appear on the live
| Wikipedia page for Ripken. Does he appear on a cached version? Or
| is it forming complex questions by finding another page in the
| same category as Ripken and pulling more facts?
| maxrmk wrote:
| I was worried people would run into this quirk in the demo. We
| have several 'advanced' question generation strategies. You
| correctly guessed the one we're using in the demo; forming
| complex questions by finding another page in the same category
| as Ripken and pulling more facts.
|
| Normally we pull a ton of related topic and try to pick the
| best, but to keep the generation fast and cost effective in the
| demo I limited the number of related pages we pull. So
| sometimes (like this case) you get something barely related and
| end up with odd disjointed questions.
| Imnimo wrote:
| Ah, that does make sense - especially for a category like a
| baseball player where it may take a lot of other pages to
| find one that's truly related. Would be expensive for a demo,
| but not a big deal for a real evaluation.
| maxrmk wrote:
| Yep, in real use cases the latency for generating questions
| doesn't really matter. But in the demo I was really worried
| about it.
| bestai wrote:
| I think for you idea to have traction (1) the questions should be
| selected by their importance and (2) the questions should be
| chained to allow new results. Just for inspiration or example you
| could create a quiz for solving a puzzle and at the same time
| solving the puzzle by answering the questions. >The big idea is
| using your tool to enhance step by step rationing in LLM.
|
| I think you could use a text area for the user to indicate if the
| quiz is about getting the main idea or if it is about testing the
| details.
|
| And for big clients, the system could be tailored so that the
| questions and structure reflect user intentions.
| tommykins wrote:
| As someone who uses Machine Learning to predict the presence of
| Talc I approve of this, even if I have no use case for it
| whatsoever.
___________________________________________________________________
(page generated 2024-01-18 23:00 UTC)