hngopher.com

       [HN Gopher] Show HN: Gentrace - connect to your LLM app code and...
       ___________________________________________________________________
        
       Show HN: Gentrace - connect to your LLM app code and run/eval it
       from a UI
        
       Hey HN - Doug from Gentrace here. We originally launched via Show
       HN in August of 2023 as evaluation and observability for generative
       AI: https://news.ycombinator.com/item?id=37238648  Since then,
       everyone from the model providers to LLM ops companies built a
       prompt playground. We had one too, until we realized this was
       totally the wrong approach:  - It's not connected to your
       application code  - They don't support all models  - You have to
       rebuild evals for just this one prompt (can't use your end-to-end
       evals)  In other words, it was a ton of work and time to use these
       to actually make your app better. So, we built a new experience and
       are relaunching around this idea:  Gentrace is a collaborative LLM
       app testing and experimentation platform that brings together
       engineers, PMs, subject matter experts, and more to run and test
       your actual end-to-end app.  To do this, use our SDK to:  - connect
       your app to Gentrace as a live runner over websocket (local) / via
       webhook (staging, prod)  - wrap your parameters (eg prompt, model,
       top-k) so they become tunable knobs in the front end  - edit the
       parameters and then run / evaluate the actual app code with
       datasets and evals in Gentrace  We think it's great for tuning
       retrieval systems, upgrading models, and iterating on prompts.
       It's free to trial. Would love to hear your feedback / what you
       think!
        
       Author : dsaffy
       Score  : 18 points
       Date   : 2024-12-10 20:35 UTC (2 days ago)
        
 (HTM) web link (gentrace.ai)
 (TXT) w3m dump (gentrace.ai)
        
       | zwaps wrote:
       | Why is it that all of the many eval and LLM ops offerings spent
       | seemingly all their energy on UI and playgrounds?
       | 
       | When it comes to tracking, tracing and versioning the entire LLM
       | callchain, so from prompt, to response models, model and workflow
       | code and code gen/exec artifacts, it's just not there. A basic
       | solution based on OpenTelemetry for some subset of an LLM app is
       | easy tondo, heck even I have written one. But what use is that?
       | 
       | Like, how many instrumentations save prompt, IO and model
       | settings without the orchestration code or agent/rag flow? How
       | does this help any production level LLM use case?
       | 
       | What is this application where i am just using bare LLM promoting
       | and RAG without any custom logic, but I need a tracing solution
       | and a collaborative prompt playground? I have yet to see it.
       | 
       | Unless we can trace and version everything that actually
       | influences the final LLM call, there is no use in a standardized
       | framework and we need to roll a bespoke solution for every case.
       | We try often and it always comes down to this.
       | 
       | Build something that allows me to trace, evaluate and track
       | everything, allow for deployment in customer tenants and on prem,
       | and you have it.
       | 
       | Stop spending your time on prompt UIs and playgrounds. We code.
       | Our LLM apps are code, lots of it! Make the foundation of your
       | framework solid first, then worry about turning temperature knobs
       | in a user interface.
        
         | dsaffy wrote:
         | Yeah we totally agree. That's why we work on the end-to-end
         | app, not on a single prompt. You pick what parameters become
         | knobs in the frontend. So if you have a giant app with 10
         | parameters (say 5 prompts, 5 numbers), great, wrap those and
         | they become knobs on our frontend. We override during the
         | actual end-to-end testing execution kinda like a Statsig /
         | Launchdarkly (only with typing).
        
       ___________________________________________________________________
       (page generated 2024-12-12 23:01 UTC)