[HN Gopher] Launch HN: Traceloop (YC W23) - Detecting LLM Halluc...
___________________________________________________________________
Launch HN: Traceloop (YC W23) - Detecting LLM Hallucinations with
OpenTelemetry
Hey everyone, we are Nir and Gal from Traceloop
(https://www.traceloop.com). We help teams understand when their
LLM apps are failing or hallucinating at scale. See a demo:
https://www.traceloop.com/video or try it yourself at
https://www.traceloop.com/docs/demo. When moving your LLM app to
production, significant scale makes it harder for engineers and
data scientists alike to understand when their LLM is hallucinating
or returning malformed responses. When you get to millions of calls
to OpenAI a month, methods like "LLM as a judge" can't work at a
reasonable cost or latency. So, what most people we talked to
usually do is sample some generations by hand, maybe for some
specific important customers, and manually look for errors or
hallucinations. Traceloop is a monitoring platform that detects
when your LLM app fails. Under the hood, we built real-time
versions of known metrics like faithfulness, relevancy, redundancy,
and many others. These are loosely based on some well-known NLP
metrics that work well for LLM-generated texts. We correlate them
with changes we detect in your system - like updates to prompts or
to the model you're using - to detect regressions automatically.
Here are some cool examples we've seen with our customers - 1.
Applying our QA relevancy metric to an entity extraction task, we
managed to discover issues where the model was not extracting the
right entities (like an address instead of a person's name); or
returning random answers like "I'm here! What can I help you with
today?". 2. Our soft-faithfulness metric was able to detect cases
in summarization tasks where a model was completely making up stuff
that never appeared in the original text. One of the challenges we
faced was figuring out how to collect the data that we need from
our customers' LLM apps. That's where OpenTelemetry came in handy.
We built OpenLLMetry (https://github.com/traceloop/openllmetry),
and announced it here almost a year ago. It standardized the use of
OpenTelemetry to observe LLM apps. We realized that the concepts of
traces, spans, metrics, and logs that were standardized with
OpenTelemetry can easily extend to gen AI. We partnered with 20+
observability platforms to make sure that OpenLLMetry becomes the
standard for GenAI observability and that the data that we collect
can be sent to other platforms as well. We plan to extend the
metrics we provide to support agents that use tools, vision models,
and other amazing developments in our fast-paced industry. We
invite you to give Traceloop a spin and are eager for your
feedback! How do you track and debug hallucinations? How much has
that been an issue for you? What types of hallucinations have you
encountered?
Author : GalKlm
Score : 61 points
Date : 2024-07-17 13:19 UTC (9 hours ago)
| Oras wrote:
| Congratulations on launch.
|
| This is a crowded market, and there are many tools doing the same
| thing.
|
| How are you differentiating yourself from other tools like:
|
| Langfuse Portkey Keywords ai Promptfoo
| nirga wrote:
| Thanks!
|
| We differentiate in 2 ways:
|
| 1. We focus on real-time monitoring. This is where we see the
| biggest pain with our customers, so we spent a lot of time
| researching and building the right metrics that can run at
| scale, fast and at low cost (and you can try them all in our
| platform).
|
| 2. OpenTelemetry - we think this is the best way to observe LLM
| app. It gives you a better understanding of how other parts of
| the system are interacting with your LLM. Say you're calling a
| vector DB, or making an HTTP call - you get them all on the
| same trace. It's also better for the customers - they're not
| vendor locked to us and can easily switch to another platform
| (or even use them in parallel).
| swyx wrote:
| not to mention langsmith? braintrust? humanloop? does that
| count? not sure what else - lets crowdsource a list here so
| that people can find them in future
| nirga wrote:
| I have it internally, I can share it if you want!
|
| But to the point of comparison between these and tools like
| Traceloop - it's interesting to see this space and how each
| platform takes it's own path and finds its own use cases.
|
| LangSmith works well within the LangChain ecosystem together
| with LangGraph, LangServe. But if you're using LlamaIndex, or
| even just vanilla OpenAI you'll be spending hours to set up
| your observability systems.
|
| Braintrust and Humanloop (and to some extend other tools I
| saw in this area) take the path of "full development platform
| for LLMs".
|
| We try to look at it as developers look at tools like Sentry.
| Continue working in your own IDE with your own tools (wanna
| manage your prompts in a DB or in git? Wanna use LLMs your
| own way with no frameworks? no problem). We install in your
| app, with one line and we work around your existing code base
| and make monitoring, evaluation and tracing work.
| Lienetic wrote:
| I'd love to see that list!
| nirga wrote:
| Ping me over slack (traceloop.com/slack) or email nir at
| traceloop dot com
| R21M1214 wrote:
| Im not sure which ones are Otel compliant. Im only aware of 3
| that are Otel compliant:
|
| 1. Traceloop Otel 2. Langtrace.ai Otel 3. OpenLIT Otel 4.
| Portkey 5. Langfuse 6. Arize LLM 7. Phoniex SDK 8. Truera LLM
| 9. Truelens 10. Context 11. Braintrust 12. Parea 13. Context
| AI 14. openlayer.com 15. Deepchecks 16. langsmith 17.
| Confident AI 18. Helicone 19. Langwatch.ai 20. Arthur 21.
| Aporia 22. scale.com 23. Whylabs 24. gentrace.ai 25.
| humanloop.com 26. fixpoint.co 27. W n B Traces 28. Langtail
| 29. Fiddler 30. Evidently Ai 31. Superwise 32. Exxa 33.
| Honeyhive 34. Flowstack 35. Log10 36. Giskard 37. Raga AI 38.
| AgentOps 39. Patronus AI 40. Mona 41. Bricks Ai 42. Sentify
| 43. LogSpend 44. Nebuly 45. Autoblocks 46. Radar / Langcheck
| 47. Dokulabs 48. Missing studio 49. Lunary.ai 50. Censius.ai
| 51. ML flow 52. Galileo 53. trubrics 54. Prompt Layer 55.
| Athina 56. getnomos.com 57. c3.ai 58. baselime.io 59.
| Honeycomb llm
| swyx wrote:
| dear god, where is this list from? surely not hand curated?
| mloncode wrote:
| 60. Radiant.AI 61. Weights & Biases (Weave) 62. Quotient
| AI (some observability there)
| resiros wrote:
| Just wanted to say great work on standardizing otel for LLM
| applications (https://github.com/open-telemetry/semantic-
| conventions/tree/...] and opensourcing OpenLLMetry. We're also
| building in this space, focusing more on eval (agenta). I think
| using otel would make the whole space move much faster.
| nirga wrote:
| Thanks so much! I always say that I'm a strong believer in open
| protocols so I'd love to assist you if you want to use
| OpenLLMetry as your SDK. We onboarded other startups /
| competitors like Helicone and Honeyhive and it's been
| tremendously successful (hopefully that's what they'll tell you
| as well)
| mshcodez wrote:
| HoneyHive founder here.
|
| Nir and team have built an amazing OSS package and have been
| fantastic to collaborate with (despite being competitors)! As
| an industry, I think more of us need to work together to
| standardize telemetry protocols, schemas, naming conventions,
| etc. since it's currently all over the place and leads to a
| ton of confusion and headache for developers (which
| ultimately goes against the whole point of using devtools in
| the first place).
|
| We recently integrated OpenLLMetry into our SDKs with the
| sole purpose of offering standardization and interoperability
| with customers' existing DevSecOps stacks. Customers have
| been loving it so far!
| threeseed wrote:
| Your startup is as deceptive as Traceloop.
|
| You make claims like "detect LLM errors like hallucination"
| even though you have no guaranteed ability to do this.
|
| At best you can assist in detection.
|
| As someone who works at a large enterprise deploying LLMs I
| can tell you many people are getting pretty tired of the
| false claims.
| nirga wrote:
| I replied to you in a different thread, I don't think
| calling our companies "deceptive" will help you or me get
| anywhere. While I agree with you that detection will
| never be hermetic, I don't think this is the goal. By
| design you'll have hallucinations and the question should
| be how can you monitor the rate and look for changes and
| anomalies.
| lukan wrote:
| No idea how honest this is (I might have gotten a bit
| cynical) - but reading this sounds like you guys have a
| really healthy constructive competition with elements of
| cooperation! Love to see that.
| swyx wrote:
| congrats on launch!
|
| the thing about OTel is that it is by nature vendor agnostic. so
| if i use OpenLLMetry, i should be able to pipe my otel traces to
| whatever existing o11y tool I use right? what is the benefit of a
| dedicated monitoring platform?
|
| (not cynical, just inviting you to explain more)
| resiros wrote:
| Not OP here (but building in the same space). The reason you
| instrument LLM data is usually to improve quality/speed of your
| applications. The tools to extract the insights to enable that,
| and the integration with your LLM experimentation workflow is
| the differentiator between a general observability solution and
| LLM specific one.
| swyx wrote:
| oh cool. do you also consume OTel? or something else?
| resiros wrote:
| Right now we have our own instrumentation but we're working
| towards Otel compatibility.
| nirga wrote:
| Great question and I see you already got a similar answer but
| I'll add some of my thoughts on this. We are actively promoting
| OpenLLMetry as a vendor agnostic way of observing LLMs (see
| some examples [1], [2]). We believe that people may start with
| whatever vendor they work with today and may gradually shift or
| use something like Traceloop because of specific features we
| have - for example the ability to take the raw data that we
| output with OpenLLMetry and add another layer of "smart
| metrics" (like qa relevancy, faithfulness, etc.) that we
| calculate on our backend / pipelines; or better tooling around
| observability of LLM calls, agents, etc.
|
| [1] https://docs.newrelic.com/docs/opentelemetry/get-
| started/tra...
|
| [2] https://docs.dynatrace.com/docs/observe-and-
| explore/dynatrac...
| BeautifulOrb wrote:
| there's a well known artist named traceloops who has a
| prolific/longstanding body of work. why did you choose this name?
| nirga wrote:
| I know! When we started every time I was googling "traceloop"
| this was the first result.
|
| 2 reasons why we chose it (in this order):
|
| 1. traceloop.com was available
|
| 2. we work with traces
| Hansenq wrote:
| an available .com is basically the only reason you should use
| https://paulgraham.com/name.html
| aaronvg wrote:
| I doubt anyone would be confused with Traceloops the artist vs
| Traceloop the LLM Observability Platform
| Lienetic wrote:
| Where can I learn more detail about the metrics you support and
| how they work?
|
| I tried multiple other solutions but kept running into the
| problem that occasionally the framework would give me some
| score/evaluation of an LLM response that didn't make any sense,
| and there was minimal information about how it came up with the
| score. Often, I'd end up digging into the implementation of the
| framework to find the underlying evaluation prompt or classifier
| only to realize that the metric name is confusing or results are
| low confidence. I'm more cautious about using these tools now and
| look more deeply at how they work so that I can assess grading
| quality before relying on them to identify problematic outputs
| (e.g. hallucinations).
| resiros wrote:
| I think the issue is that many of these metrics (e.g. RAGAS)
| are LLM as a judge metrics. These are very far from reliable.
| Making them reliable is still a research problem. I've seen a
| couple of startups training their own LLM judge models to solve
| this problem. There are also some work to attempt to improve
| the reliability through sampling such as G-eval
| (https://github.com/nlpyang/geval).
|
| One need to think of these metrics as a way to filter all the
| data to find potential issues, and not as a final evaluation
| criteria. The golden criteria should be human evaluators.
| Lienetic wrote:
| Are there any approaches today that you've found are at least
| mostly reliable? Bonus points if it is somewhat
| clear/easy/predictable to know when it isn't or won't be.
|
| We use human evaluation but that is naturally far from
| scalable, which has especially been a problem when working on
| more complicated workflows/chains where changes can have a
| cascading effect. I've been encouraging a lot of dev
| experimentation on my team but would like to get a more
| consistent eval approach so we can evaluate and discuss
| changes with more grounded results. If all of these metrics
| are low confidence, they become counterproductive since
| people easily fall into the trap of optimizing the metric.
| nirga wrote:
| I tend to find classic NLP metric more predictable and
| stable than "LLM as a judge" metrics so I'd try to see if
| you rely on them more.
|
| We've written a couple of blog posts about some of them:
| https://www.traceloop.com/blog
| swyx wrote:
| for your blog can i offer a big downvote for the massive
| ai generated cover image thing? its a trend for normies
| but for developers its absolutely meaningless. give us
| info density pls
| nirga wrote:
| roger that! I like them though (am I a normie then?)
| nirga wrote:
| We trained our own models for some of them, and we combined
| some well known NLP metrics (like Gruen [1]) to make this work.
|
| You're right that it's hard to figure out how to "trust" these
| metrics. But you shouldn't look at them as a way to get an
| objective number about your app's performance. They're more of
| a way to detect deltas - regressions or changes in performance.
| When you get more alerts, or more negative results (or less
| alerts / less negative results) - you can tell you're
| improving. And this works for tools like RAGAS as well as our
| own metrics in my view.
|
| [1] https://www.traceloop.com/blog/gruens-outstanding-
| performanc...
| bionhoward wrote:
| Check out these Wikipedia articles:
|
| Confabulation https://en.m.wikipedia.org/wiki/Confabulation
|
| Hallucination https://en.m.wikipedia.org/wiki/Hallucination
|
| What drove the AI industry to blow off accepted naming from
| psychopathology and use the word for PERCEPTUAL errors to refer
| to LANGUAGE OUTPUT errors?
|
| When AI hallucinates, and AI people already use the preferred
| term "hallucination" to label confabulations, then what's the new
| word for "hallucinations?"
|
| How will we avoid serious errors in understanding if
| hallucination in AI means confabulation in humans and $NEW_TERM
| in AI means hallucination in humans?
|
| Just seems harmful to gloss over this humongous vocabulary error.
|
| How can we claim to respect the difficulty of naming things if we
| all select the wrong answer to a basic undergrad psychology
| multiple choice question with only two options?
|
| It feels like painting ourselves into a corner which will
| inevitably make computer scientists look dumb. Who here wants to
| look dumb for no reason?
|
| I don't want to be negative, but is using the blatantly wrong
| word for confabulation a good idea in the long term?
| cmcconomy wrote:
| if i may theorize: one of these two terms is generally
| recognised by the broader english speaking community
| kcorbitt wrote:
| Big congrats on the official launch!
|
| Slightly tooting my own horn here, but at OpenPipe we've got a
| collaboration set up with Traceloop. That means you can record
| your production traces in Traceloop then export them to OpenPipe
| where you can filter/enrich them and use them to fine-tune a
| super strong model. :)
| remram wrote:
| Acknowledging that AI is unreliable, the solution is to layer
| another AI to hopefully let you know about it. Of course,
| brilliant, why did I expect anything different from the AI
| industry.
| xyst wrote:
| ?? but who is monitoring the AI layer monitoring the AI who
| produced the original output ??
|
| openai audited by claudeai which is then audited by gemini
| ai...
|
| then to close the loop, gemini ai is then audited by openai
| verdverm wrote:
| people are lazy, we're more than happy to not be in the loop
| its_ethan wrote:
| I had read the OP's comment as sarcastic, but you never know
| these days lol
|
| Your concern would be exactly mine as well, and why I assumed
| "brilliant" was sarcasm, cause it _feels like_ handing over
| the problem to the same solution that got you the problem in
| the first place?
| nirga wrote:
| It has the same logic of saying you dont want to use a
| computer to monitor or test your code since it will mean
| that a computer will monitor a computer. AI is a broad
| term, I agree you can use GPT (or any LLM) to grade an LLM
| in an accurate way but that's not the only way you can
| monitor.
| its_ethan wrote:
| > computer to monitor or test your code since it will
| mean that a computer will monitor a computer
|
| I mean... you don't trust the computer in that case, you
| trust the _person_ who wrote the test code. Computers do
| what they 're told to do, so there's no trust required of
| the computer itself. If you swap out the person (that
| you're trusting) writing that code with an AI writing
| that test code, then it's closer to your analogy - and in
| that case, I (and the guy above me, it seems) wouldn't
| trust for anything impactful.
|
| Even if you're not using an LLM specifically (which no
| one in this chain even said you were), an AI built off
| some training set to eliminate hallucinations is still
| just an AI. So you're still using an AI to keep an AI in
| check, which begs the question (posed above) of: what
| keeps your AI in check?
|
| Poking fun at a chain of AI's all keeping each other in
| check isn't really a dig at you or your company. It's
| more of a comment on the current industry moment.
|
| Best of luck to you in your endeavor anyway, by the way!
| nirga wrote:
| Thanks! I wasn't offended or anything, don't get the
| wrong impression.
|
| What strikes me odd is the fact that an AI that checks AI
| is an issue. Because AI can mean a lot of things - from a
| encoder architecture, a neural network, or a simple
| regression function. And at the end of the day, similar
| to what you said - there was a human building and fine
| tuning that AI.
|
| Anyway, this feels more of a philosophical question than
| an engineering one.
| nirga wrote:
| I'm sorry but this is not what we do. We don't use LLMs to
| grade your LLM calls.
| threeseed wrote:
| > Know when your LLM app is hallucinating or malfunctioning
|
| It astonishes me that you are willing to make so many deceptive
| claims on your website like this.
|
| You have no ability to detect with any certainty hallucinations.
| No one in the industry does.
| xyst wrote:
| clearly LLM app has added such logic to their app:
|
| ``` if (query.IsHallucinated()) { notifyHumanOfHallucination();
| } ```
|
| this one line will get them that unicorn eval
| nirga wrote:
| I think that LLMs are hallucinating by design. I'm not sure
| we'll ever get to a 0% hallucinations and we should be ok
| with it (at least for the next coming years?). So getting an
| alert on hallucination becomes less interesting. What is more
| interesting perhaps is knowing the rate that this happens.
| And keeping track on whether this rate increases or decreases
| with time or with changes to models.
| nirga wrote:
| I think it depends on the use case and how you define
| hallucinations. We've seen our metrics perform well
| (=correlates with human feedback) for use cases like
| summarization, RAG question-answering pipeline, and entity
| extraction.
|
| At the end of the day things like "answer relevancy" are pretty
| dichotomic in a sense that for a human evaluator it will be
| pretty clear whether an answer is answering a question or not.
|
| I wonder if you can elaborate on why you claim that there's no
| ability to detect with any certainty hallucinations.
| phillipcarter wrote:
| Congrats on the official launch, Nir and Gal! Deeply appreciate
| your contributions to OTel as well.
| nonameiguess wrote:
| This is poorly worded. Detecting "hallucinations" as the term is
| commonly used, as in a model making up answers not actually in
| its source text, or answers that are generally untrue, is
| fundamentally impossible. Verifying the truth of a statement
| requires empirical investigation. It isn't a feature of language
| itself. This is just the basic analytic/synthetic distinction
| identified by Kant centuries ago. It's why we have science in the
| first place and don't generate new knowledge by reading and
| learning to make convincing sounding arguments.
|
| Your far more scaled-down claim, however, that you can detect
| answers that don't address a prompt at all, or make claims when
| summarizing known other text that isn't actually in the original
| text, is definitely doable, but raises a maybe naive or stupid
| question. If you can do this, why not sell an LLM that simply
| doesn't do these stupid things in the first place? Or why do the
| people currently selling LLMs not just automatically detect
| obvious errors and not make them? Doesn't your business as
| constituted depend upon LLM vendors never figuring out how to do
| this themselves?
___________________________________________________________________
(page generated 2024-07-17 23:04 UTC)