[HN Gopher] Beyond Quacking: Deep Integration of Language Models...
___________________________________________________________________
Beyond Quacking: Deep Integration of Language Models and RAG into
DuckDB
Author : PaulHoule
Score : 105 points
Date : 2025-04-07 21:39 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| bob1029 wrote:
| You could quickly wire up one of the LLM APIs as an application-
| defined function using SQLite if you wanted to play around with
| the idea of very slow and expensive queries:
|
| https://sqlite.org/appfunc.html
|
| https://learn.microsoft.com/en-us/dotnet/standard/data/sqlit...
|
| Maybe stick with the aggregate variety of function at first if
| you don't want any billing explosions. I'd probably begin with
| something like LLM_Summary() and LLM_Classify(). The summary
| could be an aggregate, and the classify could be a scalar. Being
| able to write a query like: SELECT
| LLM_Summary(Comment) FROM Users WHERE
| datetime(Updated_At) >= datetime('now', '-1 day');
|
| Is more expedient than wiring up the equivalent code pile each
| time. The aggregation method's internals could handle
| hierarchical summarization, chunking, etc. Or, throw an error
| back to the user so they are forced to devise a more rational
| query.
| falcor84 wrote:
| I love that and would maybe even add a model price parameter on
| each such function call. Perhaps e.g. an number in the range
| 1-10, with 1 being the cheapest available, and 10 being the
| best available (whatever the price), and then we'd have
| environment settings choose the actual models to use for each
| value. And perhaps have an overload fail-safe to switch all of
| the queries to cheaper models as a form of throttling.
| simonw wrote:
| I tried that a couple of years ago with a CLI tool that uses
| Python functions called from SQLite - it worked with GPT-3.5:
| https://simonwillison.net/2023/Apr/29/enriching-data/
|
| Example usage: openai-to-sqlite query
| database.db " update messages set sentiment = chatgpt(
| 'Sentiment analysis for this message: ' || message ||
| ' - ONLY return a lowercase string from: positive, negative,
| neutral, unknown' ) where sentiment not in
| ('positive', 'negative', 'neutral', 'unknown') or
| sentiment is null "
|
| I haven't revisited the idea for fear of the amount it could
| cost if you ran it against a large database, but given the
| crashing prices of Gemini Flash, GPT-4o mini etc maybe it's
| worth another look!
| arthurcolle wrote:
| if you switch model to GPT-4.5-preview you can spend a lot of
| money very quickly
| datadrivenangel wrote:
| The API call is the same price per token regardless of how
| you run it!
| Xmd5a wrote:
| Maybe use embeddings from a BERT instead? This in particular:
| https://www.sbert.net/docs/sentence_transformer/pretrained_m.
| ..
| jt_b wrote:
| Perfect use case for https://github.com/urchade/GLiNER
| datadrivenangel wrote:
| Is there any reason why you can't build a full agent in SQL?
| SQLite with a little sauce should be competitive with Langchain
| if you think a little bit.
| bob1029 wrote:
| I don't see why not. Recursive CTEs allow you to express
| pretty much anything you desire.
|
| You could also expose additional functions corresponding to
| the external tools that you would like the agent to have
| access to and pass these as arguments to additional UDFs.
|
| You could also lean into data-driven and express much of the
| configuration in tables and then use the enhanced SQL dialect
| to tie everything together at runtime. In SQLite, during UDF
| execution, arbitrary queries can be ran. You could pull tool
| descriptions, parameter lists, enums, etc. from ordinary SQL
| tables without having to pass explicit args to the functions.
| datadrivenangel wrote:
| Use triggers to execute trees of tasks and you have a
| working agentic system!
| dpflan wrote:
| Any benefit to having a graph database component
| alongside the SQL? Maybe add some Apache Age to Postgres?
|
| """
|
| Apache AGE(tm) Graph Database for PostgreSQL Apache
| AGE(tm) is a PostgreSQL Graph database compatible with
| PostgreSQL's distributed assets and leverages graph data
| structures to analyze and use relationships and patterns
| in data. """
|
| https://age.apache.org/
| whattheheckheck wrote:
| https://explainextended.com/2023/12/31/happy-new-year-15/
| simlevesque wrote:
| I'll try this at work tomorrow !
| ofrzeta wrote:
| Here is the implementation https://github.com/dsg-
| polymtl/flockmtl
| geekodour wrote:
| This paper, and solutions such as sqlcoder, and
| https://doris.apache.org/zh-CN/blog/Tencent-LLM etc. Which let
| you query DB via natural language etc.
|
| But recently there has been a surge around MCPs being able to
| query databases provided the n-number of MCP servers popping up.
| An example:
| https://www.reddit.com/r/ChatGPTCoding/comments/1jd9lfa/lear...
|
| So I was wondering of things like the Doris blogpost, this paper
| and sqlcoder are still relevant/what extra does this approach
| offer vs trying to build a over mcp?
___________________________________________________________________
(page generated 2025-04-08 23:02 UTC)