[HN Gopher] DSPy - Programming-not prompting-LMs
___________________________________________________________________
DSPy - Programming-not prompting-LMs
Author : ulrischa
Score : 164 points
Date : 2024-12-06 19:59 UTC (1 days ago)
(HTM) web link (dspy.ai)
(TXT) w3m dump (dspy.ai)
| byefruit wrote:
| I've seen a couple of talks on DSPy and tried to use it for one
| of my projects but the structure always feels somewhat strained.
| It seems to be suited for tasks that are primarily show, don't
| tell but what do you do when you have significant prior
| instruction you want to tell?
|
| e.g Tests I want applied to anything retrieved from the database.
| What I'd like is to optimise the prompt around those (or maybe
| even the tests themselves) but I can't seem to express that in
| DSPy signatures.
| thatsadude wrote:
| You can optimize prompt with MIPROv2 without examples (set the
| max number of examples to 0)
| thom wrote:
| There are companies that will charge you 6-7 figures for what you
| can do in a few dozen lines with DSPy, but I guess that's true of
| many things.
| 3abiton wrote:
| That is a big claim though, I am not sure about that from my
| experience. What am I missing?
| behnamoh wrote:
| Every time I've seen a dspy article, I end up thinking: ok, but
| what does it do exactly?
|
| I've been using guidance, outlines, GBF grammars, etc. What
| advantage does dspy have over those alternatives?
|
| I've learnt that the best package to use LLMs is just Python.
| These "LLM packages" just make it harder to do customizations as
| they all make opinionated assumptions and decisions.
| dr_kiszonka wrote:
| Question from a casual AI user, if you have a minute. It seems
| to me that I could get much more productive by making my own
| personal AI "system". For example, write a simple pipeline
| where Claude would scrutinize OpenAI's answers and vice versa.
|
| Are there any beginner-friendly Python packages that you would
| recommend to facilitate fast experimentation with such ideas?
| gradys wrote:
| I assume you know how to program in Python? I would start
| with just the client libraries of the model providers you
| want to use. LLMs are conceptually simple when treated as
| black boxes. String in, string out. You don't necessarily
| need a framework.
| monkmartinez wrote:
| Not the person you asked, but I will take a shot at your
| question. The best python package for getting what you want
| out of these systems is to use original gangster python with
| libraries to help with your goals.
|
| For your example; Write a python script with requests that
| hits the OpenAI API. You can even hardcode the API key
| because its just a script on your computer! Now you have the
| GPT-4proLight-mini-deluxe response in JSON. You can pipe that
| into a bazzillion and one different places including another
| API request to Anthropic. Once that returns, you can now have
| TWO llm responses to analyze.
|
| I tried haystack, langchain, txtai, langroid, CrewAI,
| Autogen, and more that I am forgetting. One day while I was
| reading r/Localllama someone wrote; "All these packages are
| TRASH, just write python!"... Lightbulb moment for me. Duh!
| Now I don't need to learn a massive framework to only use
| 1/363802983th of it while cursing that I can't figure out how
| to make it do what I want it to do.
|
| Just write python. I tell you that has been massive for my
| usage of these LLM's outside of the chat interfaces like
| LibreChat and OpenWebUI. You can even have claude or deepseek
| write the script for you. That often gets me within striking
| distance of what I really want to achieve at that moment.
| digdugdirk wrote:
| I've had good luck with a light "shim layer" library that
| handles the actual interfacing with the api and implements
| the plumbing on any fun new features that get introduced.
|
| I've settled on the Mirascope library
| (https://mirascope.com/), which suits my use cases and lets
| me implement structured inputs/outputs via pydantic models,
| which is nice. I really like using it, and the team behind it
| is really responsive and helpful.
|
| That being said, Pydantic just released an AI library of
| their own (https://ai.pydantic.dev/) that I haven't checked
| out, but I'd love to hear from someone who has! Given their
| track record, it's certainly worth keeping an eye on.
| pizza wrote:
| Yes, anthropic just released model context protocol and mcp
| is perfect for this kind of thing. I actually wrote an mcp
| server for claude to call out to openai just yesterday.
| kingkongjaffa wrote:
| You could use the plain api libraries for each llm and
| ipython notebooks, conceptually each block could be a node or
| link in the prompt chain, and input/output of each block is
| printable and visible to check which part of the chain is the
| part that is failing or has sub optimal outputs.
| sdesol wrote:
| > For example, write a simple pipeline where Claude would
| scrutinize OpenAI's answers and vice versa.
|
| I'm working on a naive approach to identify errors in LLM
| responses which I talk about at
| https://news.ycombinator.com/item?id=42313401#42313990, which
| can be used to scrutinize responses. It's written in
| Javascript though, but you will be able to create a new chat
| by calling a http endpoint.
|
| I'm hoping to have the system in place in a couple of weeks.
| jackmpcollins wrote:
| I'm building magentic for use cases like this!
|
| https://github.com/jackmpcollins/magentic
|
| It's based on pydantic and aims to make writing LLM queries
| as easy/compact as possible by using type annotations,
| including for structured outputs and streaming. If you use it
| please reach out!
| d4rkp4ttern wrote:
| You can have a look at Langroid -- it's an agent-oriented LLM
| programming framework from CMU/UW-Madison researchers. We
| started building it in Apr 2023 out of frustration with the
| bloat of then-existing libs.
|
| In langroid you set up a ChatAgent class which encapsulates
| an LLM-interface plus any state you'd like. There's a Task
| class that wraps an Agent and allows inter-agent
| communication and tool-handling. We have devs who've found
| our framework easy to understand and extend for their
| purposes, and some companies are using it in production (some
| have endorsed us publicly). A quick tour gives a flavor of
| Langroid:
| https://langroid.github.io/langroid/tutorials/langroid-tour/
|
| Feel free to drop into our discord for help.
| scosman wrote:
| Can someone explain what DSPy does that fine tuning doesn't?
| Structured IO, optimized to better results. Sure. But why just
| just go straight to weights, instead of trying to optimize the
| few-shot space?
| flakiness wrote:
| It has multiple optimization strategies. One is optimizing the
| few shot list. Another is to let the model write prompts and
| pick the best one based on the given eval. I doubt latter much
| more intriguing although I have no idea how practical it is.
| choppaface wrote:
| The main idea behind DSPy is that you can't modify the weights,
| but you can perhaps modify the prompts. DSPy's original primary
| customer was multi-llm-agent systems where you have a chain /
| graph of LLM calls (perhaps mostly or all to OpenAI GPT) and
| you have some metric (perhaps vague) that you want to increase.
| While the idea may seem a bit weird, there have been various
| success stories, such as a UoT team winning medical-notes-
| oriented competition using DSPy
| https://arxiv.org/html/2404.14544v1
| deepsquirrelnet wrote:
| I use DSPy often, and it's the only framework that I have much
| interest in using professionally.
|
| Evaluations are first class and have a natural place in
| optimization. I still usually spend some time adjusting initial
| prompts, but more time doing traditional ML things... like
| working with SMEs, building training sets, evaluating models and
| developing the pipeline. If you're an ML engineer that's
| frustrated by the "loose" nature of developing applications with
| LLMs, I recommend trying it out.
|
| With assertions and suggestions, there's also additional pathways
| you can use to enforce constraints on the output and build in
| requirements from your customer.
| qeternity wrote:
| What do you actually use it for? I've never been able to
| actually get it to perform on anything remotely close to what
| it claims. Sure, it can help optimize few shot prompting...but
| what else can it reliably do?
| deepsquirrelnet wrote:
| It isn't for every application, but I've used it for tasks
| like extraction, summarization and generating commands where
| you have specific constraints you're trying to meet.
|
| Most important to me is that I can write evaluations based on
| feedback from the team and build them into the pipeline using
| suggestions and track them with LLM as a judge (and other)
| metrics. With some of the optimizers, you can use stronger
| models to help propose and test new instructions for your
| student model to follow, as well as optimize the N shot
| examples to use in the prompt (MIPROv2 optimizer).
|
| It's not that a lot of that can't be done other ways, but as
| a framework it provides a non-trivial amount of value to me
| when I'm trying to keep track of requirements that grow over
| time instead of playing the whack a mole game in the prompt.
| huevosabio wrote:
| Every time I check the docs, I feel like it obfuscates so many
| things that it puts me off and I decide to just not try it out.
|
| Behind the scenes it's using LLM's to find the proper
| prompting. I find that it uses a terminology and abstraction
| that is way too complicated for what it is.
| beepbooptheory wrote:
| How does it work? Like I can see the goal and the results, but is
| it in fact the case that its _still_ here "LLMs all the way
| down"? That is, is there a supplement bot here thats fine tuned
| to DSPy syntax, doing the actual work of turning the code to
| prompt? Trying to figure out how else it would work.. But if that
| is the case, this really feels like a Wizard of Oz behind the
| curtain thing.
| aaronvg wrote:
| I found it interesting how DSPy created the Signatures concept:
| https://dspy.ai/learn/programming/signatures/
|
| We took this kind of concept all the way to making a DSL called
| BAML, where prompts look like literal functions, with input and
| output types.
|
| Playground link here https://www.promptfiddle.com/
|
| https://github.com/BoundaryML/baml
|
| (tried pasting code but the formatting is completely off here,
| sorry).
|
| We think we could run some optimizers on this as well in the
| future! We'll definitely use DSPy as inspiration!
| dcreater wrote:
| Good dx? BAML looks even worse than the current API call based
| paradigm.
|
| Even your toy examples look bad - wouldn't want to see what an
| actual program would look like.
|
| Hopefully this, dspy and the like that have poor design,
| inelegant won't become common standards
| aaronvg wrote:
| How do you organize your prompts? Do you use a templating
| language like jinja? How complex are your prompts? Do you
| have any open source examples?
|
| I'm genuinely curious since if we can convince someone like
| you that BAML is amazing we're on a good track.
|
| We've helped people remove really ugly concatenated strings
| or raw yaml files with json schemas just by using our prompt
| format (which uses jinja2!)
| coffeephoenix wrote:
| I tried using it but one of the hard parts is defining a good
| metric that the underlying optimizer can use. Came up with an
| approach for that here:
|
| https://colab.research.google.com/drive/1obuS9cEWN9MT-MIv5aL...
| jmugan wrote:
| As far as I can tell, Google VertexAI prompt optimizer does
| similar things. I find their documentation more comprehensible.
| https://cloud.google.com/vertex-ai/generative-ai/docs/learn/...
| Imanari wrote:
| Can someone explain how it works?
| thomasahle wrote:
| Here is a simple example of dspy: classify =
| dspy.Predict(f"text -> label:Literal{CLASSES}")
| optimized = dspy.BootstrapFewShot(metric=(lambda x, y: x.label ==
| y.label)) .compile(classify,
| trainset=load_dataset('Banking77')) label =
| optimized(text="What does a pending cash withdrawal mean?").label
|
| What this does is optimize a prompt given a dataset (here
| Banking77).
|
| The optimizer, BootstrapFewShot, simply selects a bunch of random
| subsets from the training set, and measures which gives the best
| performance on the rest of the dataset when used as few-shot
| examples.
|
| There are also more fancy optimizers, including ones that first
| optimize the prompt, and then use the improved model as a teacher
| to optimize the weights. This has the advantage that you don't
| need to pay for a super long prompt on every inference call.
|
| dspy has more cool features, such as the ability to train a large
| composite LLM program "end to end", similar to backprop.
|
| The main advantage, imo, is just not having "stale" prompts
| everywhere in your code base. You might have written some neat
| few-shot examples for the middle layers of your pipeline, but
| then you change something at the start, and you have to manually
| rewrite all the examples for every other module. With dspy you
| just keep your training datasets around, and the rest is
| automated.
|
| (Note, the example above is taken from the new website:
| https://dspy.ai/#__tabbed_3_3 and simplified a bit)
| thatsadude wrote:
| My go to framework. I wish we can use global metrics in DSPy, for
| examples, F1 score over the whole evaluation set (instead of a
| single query at the moment). The recent async support has been
| life saver.
| dcreater wrote:
| DSPy seems unnecessarily convoluted, inelegant or am I just
| stupid?
| th0ma5 wrote:
| I think you read it right. It is in my mind a kind of wish
| casting that adding other modeling to LLMs can improve their
| use, but the ideas all sound like playing with your food at
| best, and deliberately confusing people to prey on their
| excitement at the worst.
| edmundsauto wrote:
| I'm torn - I like the promise and people are getting value
| out of it. I need to try it myself on a toy project!
|
| What experiences/evidence do you have that informed your
| opinion? It sounds like you've had pretty negative
| experiences.
___________________________________________________________________
(page generated 2024-12-07 23:01 UTC)