[HN Gopher] DSPy - Programming-not prompting-LMs
       ___________________________________________________________________
        
       DSPy - Programming-not prompting-LMs
        
       Author : ulrischa
       Score  : 164 points
       Date   : 2024-12-06 19:59 UTC (1 days ago)
        
 (HTM) web link (dspy.ai)
 (TXT) w3m dump (dspy.ai)
        
       | byefruit wrote:
       | I've seen a couple of talks on DSPy and tried to use it for one
       | of my projects but the structure always feels somewhat strained.
       | It seems to be suited for tasks that are primarily show, don't
       | tell but what do you do when you have significant prior
       | instruction you want to tell?
       | 
       | e.g Tests I want applied to anything retrieved from the database.
       | What I'd like is to optimise the prompt around those (or maybe
       | even the tests themselves) but I can't seem to express that in
       | DSPy signatures.
        
         | thatsadude wrote:
         | You can optimize prompt with MIPROv2 without examples (set the
         | max number of examples to 0)
        
       | thom wrote:
       | There are companies that will charge you 6-7 figures for what you
       | can do in a few dozen lines with DSPy, but I guess that's true of
       | many things.
        
         | 3abiton wrote:
         | That is a big claim though, I am not sure about that from my
         | experience. What am I missing?
        
       | behnamoh wrote:
       | Every time I've seen a dspy article, I end up thinking: ok, but
       | what does it do exactly?
       | 
       | I've been using guidance, outlines, GBF grammars, etc. What
       | advantage does dspy have over those alternatives?
       | 
       | I've learnt that the best package to use LLMs is just Python.
       | These "LLM packages" just make it harder to do customizations as
       | they all make opinionated assumptions and decisions.
        
         | dr_kiszonka wrote:
         | Question from a casual AI user, if you have a minute. It seems
         | to me that I could get much more productive by making my own
         | personal AI "system". For example, write a simple pipeline
         | where Claude would scrutinize OpenAI's answers and vice versa.
         | 
         | Are there any beginner-friendly Python packages that you would
         | recommend to facilitate fast experimentation with such ideas?
        
           | gradys wrote:
           | I assume you know how to program in Python? I would start
           | with just the client libraries of the model providers you
           | want to use. LLMs are conceptually simple when treated as
           | black boxes. String in, string out. You don't necessarily
           | need a framework.
        
           | monkmartinez wrote:
           | Not the person you asked, but I will take a shot at your
           | question. The best python package for getting what you want
           | out of these systems is to use original gangster python with
           | libraries to help with your goals.
           | 
           | For your example; Write a python script with requests that
           | hits the OpenAI API. You can even hardcode the API key
           | because its just a script on your computer! Now you have the
           | GPT-4proLight-mini-deluxe response in JSON. You can pipe that
           | into a bazzillion and one different places including another
           | API request to Anthropic. Once that returns, you can now have
           | TWO llm responses to analyze.
           | 
           | I tried haystack, langchain, txtai, langroid, CrewAI,
           | Autogen, and more that I am forgetting. One day while I was
           | reading r/Localllama someone wrote; "All these packages are
           | TRASH, just write python!"... Lightbulb moment for me. Duh!
           | Now I don't need to learn a massive framework to only use
           | 1/363802983th of it while cursing that I can't figure out how
           | to make it do what I want it to do.
           | 
           | Just write python. I tell you that has been massive for my
           | usage of these LLM's outside of the chat interfaces like
           | LibreChat and OpenWebUI. You can even have claude or deepseek
           | write the script for you. That often gets me within striking
           | distance of what I really want to achieve at that moment.
        
           | digdugdirk wrote:
           | I've had good luck with a light "shim layer" library that
           | handles the actual interfacing with the api and implements
           | the plumbing on any fun new features that get introduced.
           | 
           | I've settled on the Mirascope library
           | (https://mirascope.com/), which suits my use cases and lets
           | me implement structured inputs/outputs via pydantic models,
           | which is nice. I really like using it, and the team behind it
           | is really responsive and helpful.
           | 
           | That being said, Pydantic just released an AI library of
           | their own (https://ai.pydantic.dev/) that I haven't checked
           | out, but I'd love to hear from someone who has! Given their
           | track record, it's certainly worth keeping an eye on.
        
           | pizza wrote:
           | Yes, anthropic just released model context protocol and mcp
           | is perfect for this kind of thing. I actually wrote an mcp
           | server for claude to call out to openai just yesterday.
        
           | kingkongjaffa wrote:
           | You could use the plain api libraries for each llm and
           | ipython notebooks, conceptually each block could be a node or
           | link in the prompt chain, and input/output of each block is
           | printable and visible to check which part of the chain is the
           | part that is failing or has sub optimal outputs.
        
           | sdesol wrote:
           | > For example, write a simple pipeline where Claude would
           | scrutinize OpenAI's answers and vice versa.
           | 
           | I'm working on a naive approach to identify errors in LLM
           | responses which I talk about at
           | https://news.ycombinator.com/item?id=42313401#42313990, which
           | can be used to scrutinize responses. It's written in
           | Javascript though, but you will be able to create a new chat
           | by calling a http endpoint.
           | 
           | I'm hoping to have the system in place in a couple of weeks.
        
           | jackmpcollins wrote:
           | I'm building magentic for use cases like this!
           | 
           | https://github.com/jackmpcollins/magentic
           | 
           | It's based on pydantic and aims to make writing LLM queries
           | as easy/compact as possible by using type annotations,
           | including for structured outputs and streaming. If you use it
           | please reach out!
        
           | d4rkp4ttern wrote:
           | You can have a look at Langroid -- it's an agent-oriented LLM
           | programming framework from CMU/UW-Madison researchers. We
           | started building it in Apr 2023 out of frustration with the
           | bloat of then-existing libs.
           | 
           | In langroid you set up a ChatAgent class which encapsulates
           | an LLM-interface plus any state you'd like. There's a Task
           | class that wraps an Agent and allows inter-agent
           | communication and tool-handling. We have devs who've found
           | our framework easy to understand and extend for their
           | purposes, and some companies are using it in production (some
           | have endorsed us publicly). A quick tour gives a flavor of
           | Langroid:
           | https://langroid.github.io/langroid/tutorials/langroid-tour/
           | 
           | Feel free to drop into our discord for help.
        
       | scosman wrote:
       | Can someone explain what DSPy does that fine tuning doesn't?
       | Structured IO, optimized to better results. Sure. But why just
       | just go straight to weights, instead of trying to optimize the
       | few-shot space?
        
         | flakiness wrote:
         | It has multiple optimization strategies. One is optimizing the
         | few shot list. Another is to let the model write prompts and
         | pick the best one based on the given eval. I doubt latter much
         | more intriguing although I have no idea how practical it is.
        
         | choppaface wrote:
         | The main idea behind DSPy is that you can't modify the weights,
         | but you can perhaps modify the prompts. DSPy's original primary
         | customer was multi-llm-agent systems where you have a chain /
         | graph of LLM calls (perhaps mostly or all to OpenAI GPT) and
         | you have some metric (perhaps vague) that you want to increase.
         | While the idea may seem a bit weird, there have been various
         | success stories, such as a UoT team winning medical-notes-
         | oriented competition using DSPy
         | https://arxiv.org/html/2404.14544v1
        
       | deepsquirrelnet wrote:
       | I use DSPy often, and it's the only framework that I have much
       | interest in using professionally.
       | 
       | Evaluations are first class and have a natural place in
       | optimization. I still usually spend some time adjusting initial
       | prompts, but more time doing traditional ML things... like
       | working with SMEs, building training sets, evaluating models and
       | developing the pipeline. If you're an ML engineer that's
       | frustrated by the "loose" nature of developing applications with
       | LLMs, I recommend trying it out.
       | 
       | With assertions and suggestions, there's also additional pathways
       | you can use to enforce constraints on the output and build in
       | requirements from your customer.
        
         | qeternity wrote:
         | What do you actually use it for? I've never been able to
         | actually get it to perform on anything remotely close to what
         | it claims. Sure, it can help optimize few shot prompting...but
         | what else can it reliably do?
        
           | deepsquirrelnet wrote:
           | It isn't for every application, but I've used it for tasks
           | like extraction, summarization and generating commands where
           | you have specific constraints you're trying to meet.
           | 
           | Most important to me is that I can write evaluations based on
           | feedback from the team and build them into the pipeline using
           | suggestions and track them with LLM as a judge (and other)
           | metrics. With some of the optimizers, you can use stronger
           | models to help propose and test new instructions for your
           | student model to follow, as well as optimize the N shot
           | examples to use in the prompt (MIPROv2 optimizer).
           | 
           | It's not that a lot of that can't be done other ways, but as
           | a framework it provides a non-trivial amount of value to me
           | when I'm trying to keep track of requirements that grow over
           | time instead of playing the whack a mole game in the prompt.
        
         | huevosabio wrote:
         | Every time I check the docs, I feel like it obfuscates so many
         | things that it puts me off and I decide to just not try it out.
         | 
         | Behind the scenes it's using LLM's to find the proper
         | prompting. I find that it uses a terminology and abstraction
         | that is way too complicated for what it is.
        
       | beepbooptheory wrote:
       | How does it work? Like I can see the goal and the results, but is
       | it in fact the case that its _still_ here  "LLMs all the way
       | down"? That is, is there a supplement bot here thats fine tuned
       | to DSPy syntax, doing the actual work of turning the code to
       | prompt? Trying to figure out how else it would work.. But if that
       | is the case, this really feels like a Wizard of Oz behind the
       | curtain thing.
        
       | aaronvg wrote:
       | I found it interesting how DSPy created the Signatures concept:
       | https://dspy.ai/learn/programming/signatures/
       | 
       | We took this kind of concept all the way to making a DSL called
       | BAML, where prompts look like literal functions, with input and
       | output types.
       | 
       | Playground link here https://www.promptfiddle.com/
       | 
       | https://github.com/BoundaryML/baml
       | 
       | (tried pasting code but the formatting is completely off here,
       | sorry).
       | 
       | We think we could run some optimizers on this as well in the
       | future! We'll definitely use DSPy as inspiration!
        
         | dcreater wrote:
         | Good dx? BAML looks even worse than the current API call based
         | paradigm.
         | 
         | Even your toy examples look bad - wouldn't want to see what an
         | actual program would look like.
         | 
         | Hopefully this, dspy and the like that have poor design,
         | inelegant won't become common standards
        
           | aaronvg wrote:
           | How do you organize your prompts? Do you use a templating
           | language like jinja? How complex are your prompts? Do you
           | have any open source examples?
           | 
           | I'm genuinely curious since if we can convince someone like
           | you that BAML is amazing we're on a good track.
           | 
           | We've helped people remove really ugly concatenated strings
           | or raw yaml files with json schemas just by using our prompt
           | format (which uses jinja2!)
        
       | coffeephoenix wrote:
       | I tried using it but one of the hard parts is defining a good
       | metric that the underlying optimizer can use. Came up with an
       | approach for that here:
       | 
       | https://colab.research.google.com/drive/1obuS9cEWN9MT-MIv5aL...
        
       | jmugan wrote:
       | As far as I can tell, Google VertexAI prompt optimizer does
       | similar things. I find their documentation more comprehensible.
       | https://cloud.google.com/vertex-ai/generative-ai/docs/learn/...
        
       | Imanari wrote:
       | Can someone explain how it works?
        
       | thomasahle wrote:
       | Here is a simple example of dspy:                   classify =
       | dspy.Predict(f"text -> label:Literal{CLASSES}")
       | optimized = dspy.BootstrapFewShot(metric=(lambda x, y: x.label ==
       | y.label))                         .compile(classify,
       | trainset=load_dataset('Banking77'))              label =
       | optimized(text="What does a pending cash withdrawal mean?").label
       | 
       | What this does is optimize a prompt given a dataset (here
       | Banking77).
       | 
       | The optimizer, BootstrapFewShot, simply selects a bunch of random
       | subsets from the training set, and measures which gives the best
       | performance on the rest of the dataset when used as few-shot
       | examples.
       | 
       | There are also more fancy optimizers, including ones that first
       | optimize the prompt, and then use the improved model as a teacher
       | to optimize the weights. This has the advantage that you don't
       | need to pay for a super long prompt on every inference call.
       | 
       | dspy has more cool features, such as the ability to train a large
       | composite LLM program "end to end", similar to backprop.
       | 
       | The main advantage, imo, is just not having "stale" prompts
       | everywhere in your code base. You might have written some neat
       | few-shot examples for the middle layers of your pipeline, but
       | then you change something at the start, and you have to manually
       | rewrite all the examples for every other module. With dspy you
       | just keep your training datasets around, and the rest is
       | automated.
       | 
       | (Note, the example above is taken from the new website:
       | https://dspy.ai/#__tabbed_3_3 and simplified a bit)
        
       | thatsadude wrote:
       | My go to framework. I wish we can use global metrics in DSPy, for
       | examples, F1 score over the whole evaluation set (instead of a
       | single query at the moment). The recent async support has been
       | life saver.
        
       | dcreater wrote:
       | DSPy seems unnecessarily convoluted, inelegant or am I just
       | stupid?
        
         | th0ma5 wrote:
         | I think you read it right. It is in my mind a kind of wish
         | casting that adding other modeling to LLMs can improve their
         | use, but the ideas all sound like playing with your food at
         | best, and deliberately confusing people to prey on their
         | excitement at the worst.
        
           | edmundsauto wrote:
           | I'm torn - I like the promise and people are getting value
           | out of it. I need to try it myself on a toy project!
           | 
           | What experiences/evidence do you have that informed your
           | opinion? It sounds like you've had pretty negative
           | experiences.
        
       ___________________________________________________________________
       (page generated 2024-12-07 23:01 UTC)