[HN Gopher] Building AI agents to query your databases
___________________________________________________________________
Building AI agents to query your databases
Author : vortex_ape
Score : 150 points
Date : 2025-03-14 10:51 UTC (4 days ago)
(HTM) web link (blog.dust.tt)
(TXT) w3m dump (blog.dust.tt)
| lennythedev wrote:
| Just messed around with the concept last week. With a good enough
| schema explanation, bigger reasoning models did an amazing job.
| Definitely something I'm going to use in my dataflows.
| Tewboo wrote:
| I've built AI agents for database queries; they streamline
| processes but require careful design to avoid overloading the
| system.
| Sayyidalijufri wrote:
| Cool
| bob1029 wrote:
| > This abstraction shields users from the complexity of the
| underlying systems and allows us to add new data sources without
| changing the user experience.
|
| Cursed mission. These sorts of things do work amazingly well for
| toy problem domains. But, once you get into more complex business
| involving 4-way+ joins, things go sideways fast.
|
| I think it might be possible to have a human in the loop during
| the SQL authoring phase, but there's no way you can do it clean
| without outside interaction in all cases.
|
| 95% correct might sound amazing at first, but it might as well be
| 0% in practice. You need to be perfectly correct when working
| with data in bulk with SQL operations.
| ritz_labringue wrote:
| It does require writing good instructions for the LLM to
| properly use the tables, and it works best if you carefully
| pick the tables that your agent is allowed to use beforehand.
| We have many users that use it for every day work with real
| data (definitely not toy problems).
| iLoveOncall wrote:
| If only we had a language to accurately describe what we want
| to retrieve from the database! Alas, one can only dream!
| spolu wrote:
| Yes you are perfectly right. Our product pushes users to be
| selective on the tables they give access to a given agent for
| a given use-case :+1:
|
| The tricky part is correctly supporting multiple systems
| which each have their own specificity. All the way to
| Salesforce which is an entirely different beast in terms of
| query language. We're working on it right now and will likely
| follow-up with a blog post there :+1:
| scottbcovert wrote:
| Salesforce architect here (from partner firm, not the
| mothership directly)--Salesforce's query language, SOQL, is
| definitely a different beast as you say. I'd like to learn
| more about the issues you're having with the integration,
| specifically the permissions enforcement. I may be
| misunderstanding what you meant in the blog post, but if
| you're passing a SOQL query through the REST API then the
| results will be scoped by default to the permissions of the
| user that went through the OAuth flow. My email is in my
| profile if you're open to connecting.
| troupo wrote:
| > It does require writing good instructions for the LLM to
| properly use the tables
|
| --- start quote ---
|
| prompt engineering is nothing but an attempt to reverse-
| engineer a non-deterministic black box for which any of the
| parameters below are unknown:
|
| - training set
|
| - weights
|
| - constraints on the model
|
| - layers between you and the model that transform both your
| input and the model's output that can change at any time
|
| - availability of compute for your specific query
|
| - and definitely some more details I haven't thought of
|
| https://dmitriid.com/prompting-llms-is-not-engineering
|
| --- end quote ---
| aargh_aargh wrote:
| What else is engineering then if not taming the unknown and
| the unknowable? How is building a bridge any different? Do
| you know everything in advance about the composition of
| terrain, the traffic, the wind and the earthquakes? Or are
| you making educated guesses about unknown quantities to get
| something that fits into some parameters that are Good
| Enough(TM) for the given purpose?
| achierius wrote:
| > and the unknowable
|
| This is the crux. Sure, for high level software (e.g. Web
| apps), many parts of the system will feel like black
| boxes, but low-level software does not generally have
| this problem. Sure, sometimes you have to deal with a
| binary blob driver, but more often than not you're in
| control of or and to debug most all of the software
| running on your system.
|
| > Building a bridge
|
| There should NOT be significant unknowns when you're
| building a bridge, this is how people die. You turn those
| parameters into "knowns with high confidence", which is
| not something you can even begin to do for the LLM
| parameters described above.
| spolu wrote:
| Hi, you are right that things can go sideways fast. In
| practice, the data that the typical employee needs is also
| quite simple. So there is definitely a very nice fit for this
| kind of product with a large number of use-case that we do see
| provide a lot of value internally for employees (self access to
| data) and data scientist (reducing loads).
|
| For complex queries/use-cases, we generally instead push our
| users to create agents that assist them in shaping SQL
| directly, instead of going directly from text to result/graphs.
| Pushes them to think more about correctness while still saving
| them tone of time (the agent has access to the table schemas
| etc...), but not a good fit for non technical people of course.
| abirch wrote:
| This works well until it doesn't. As long as there is someone
| who is responsible for the data correctness. E.g. the
| cardinality of two joining tables maintained cardinality
| instead of: there's currently no one in the system in with
| two locations in the employee_to_location table so it works
| right now. One it happens there will be the wrong employee
| count from this query
| ozim wrote:
| A bit of a joke here:
|
| It solves 100% cases where some manager requests dashboard
| never to look at it again after one day.
| aargh_aargh wrote:
| There's no smoke without fire.
| mritchie712 wrote:
| Using a semantic layer is the cleanest way to have a human in
| the loop. A human can validate and create all important metrics
| (e.g. what does "monthly active users" really mean) then an LLM
| can use that metric definition whenever asked for MAU.
|
| With a semantic layer, you get the added benefit of writing
| queries in JSON instead of raw SQL. LLM's are much more
| consistent at writing a small JSON vs. hundreds of lines of
| SQL.
|
| We[0] use cube[1] for this. It's the best open source semantic
| layer, but there's a couple closed source options too.
|
| 0 - https://www.definite.app/
|
| 1 - https://cube.dev/
| tillvz wrote:
| Completely agree! A semantic layer is essential for scaling
| analytics to enterprise complexity.
|
| Another alternative here is Veezoo [0], which combines the
| semantic layer (a Knowledge Graph) and self-service analytics
| into one integrated solution.
|
| We built it specifically for the analytics use-case for both
| the "data persona" to manage the semantic layer, as well as
| for the "business persona" to analyze the data.
|
| If you're looking for a semantic-layer + (embedded) BI
| solution right out of the box. This could be a fit.
|
| 0 - https://www.veezoo.com
| efromvt wrote:
| JSON generation against a semantic layer and validation loops
| is definitely the easiest way to get high 9s success for
| going directly to a correct query from text. For the human in
| the loop cases, going directly to SQL can be fun - I've toyed
| with a SQL-like semantic layer that removes the need for
| direct table access and joins, which removes two of the risk
| points for LLMs going off the rails while still leveraging a
| lot of the baked in knowledge about SQL syntax (window
| functions, transformations, etc) that can be hard to
| exhaustively bake into the semantic layer. (It's annoying
| when the semantic layer doesn't quite have what you need.)
| mritchie712 wrote:
| duckdb has a `json_serialize_sql` function that we've been
| messing with. It could be an interesting middle ground. It
| lets you write SQL as JSON.
| pbronez wrote:
| How well is that working for you?
|
| We use a pattern where we ETL things into tables that model
| the upstream source closely, then use SQL Views to tighten up
| the model and integrate across data sources where needed.
| Keeping this all inside one DB allows us to use tools that
| understand the schema for autocomplete, etc.
|
| I expect the developer experience would be significantly
| worse if we started writing views in YAML instead of SQL...
| but you've found the opposite?
| mritchie712 wrote:
| Do you care about self-serve analytics (i.e. allowing
| people that don't know SQL to explore your data)?
|
| A semantic layer is the best way to enable self-serve
| analytics, but if you don't care about it, it's probably
| not worth the hassle.
|
| We also use the semantic layer for other nice things like
| setting goals, column descriptions and other metadata.
| zerop wrote:
| Currently exploring cube for a "natural language to SQL"
| solution.
|
| My schema is - 90+ Tables, 2500+ Columns, well documented
|
| From your experience, does Cube look a fit? My use cases will
| definitely have JOINS.
| mritchie712 wrote:
| yes, that shouldn't be a problem.
|
| with that many tables, you might want to use Views:
| https://cube.dev/docs/reference/data-model/view
| zerop wrote:
| Thanks. sorry, asking more question - Do we need human in
| the loop with Cube to define the views for all kinds of
| queries.
|
| In my use case, it's going to be exposed to various kind
| of stakeholders and there will be versatility of user
| queries. I can't pre-create views/aggregations for all
| scenarios.
| fsndz wrote:
| some people say agents can help. but still difficult to pull it
| without robust human in the loop: https://medium.com/thoughts-
| on-machine-learning/build-a-text...
| mritchie712 wrote:
| We (https://www.definite.app/) solved this a bit differently.
|
| We spin up a data lake and pipelines (we support 500+
| integrations / connectors) to populate the data lake for you then
| put DuckDB on top as a single query engine to access all your
| data.
| yonl wrote:
| This is really interesting. At my previous company, I built a
| data lakehouse for operational reporting with recency
| prioritization (query only recent data, archive the rest).
| While there was no LLM integration when I left, I've learned
| from former colleagues that they've since added a lightweight
| LLM layer on top (though I suspect Dustt's implementation is
| more comprehensive).
|
| Our main requirement was querying recent operational data
| across daily/weekly/monthly/quarterly timeframes. The data
| sources included OLTP binlogs, OLAP views, SFDC, and about 15
| other marketing platforms. We implemented a datalake with our
| own query and archival layers. This approach worked well for
| queries like "conversion rate per channel this quarter" where
| we needed broad data coverage (all 17 integrations) but
| manageable depth (reasonable row scanned).
|
| This architecture also enabled quick solutions for additional
| use cases, like on-the-fly SFDC data enrichment that our
| analytics team could handle independently. Later, I learned the
| team integrated LLMs as they began dumping OLAP views inside
| the datalake for different query types, and eventually replaced
| our original query layer with DuckDB.
|
| I believe approaches like these (what I had done as in house
| solution and what definite may be doing more extensively) are
| data and query-pattern focused first. While it might initially
| seem like overkill, this approach can withstand organizational
| complexity challenges - with LLMs serving primarily as an
| interpretation layer. From skimming the Dustt blog, their
| approach is refreshing, though it seems their product was built
| primarily for LLM integration rather than focusing first on
| data management and scale. They likely have internal mechanisms
| to handle various use cases that weren't detailed in the blog.
| nicholaschiang wrote:
| Do you host the data lake yourself? How do you handle data
| privacy?
|
| Some large enterprises won't allow their data to leave e.g.
| their own Snowflake warehouse. All tooling they use has to run
| on top of their warehouse.
| gavinray wrote:
| Interesting -- I work on a similar tool [1], and the JSON IR
| representation for your query is similar to the internal IR we
| used for our data connectors.
|
| [1] https://promptql.hasura.io
| LaGrange wrote:
| I mean this is generally repulsive, but please I beg of you, run
| this exclusively against a read only replica. I mean, you should
| have one for exploratory queries _anyway_, but nobody ever does
| that.
|
| "Validating the query to ensure it's safe and well-formed" all I
| can say to that is "ROFL. LMAO."
| RajT88 wrote:
| Thank you.
|
| "AI Bot, summarize the number of logins this week broken down
| by email address and password hash".
| cdblades wrote:
| Yep. MCP is a project run by Anthropic
| (https://github.com/modelcontextprotocol) that "enables
| seamless integration between LLM applications and external data
| sources and tools."
|
| They host a repo with lots of what they call "reference"
| implementations, including this one for postgres which naively
| takes a query from a request and shoves it at your database
| with no validation, escaping, anything:
| https://github.com/modelcontextprotocol/servers/blob/7d6cdb6...
|
| There's an issue calling this out, and it's been labeled a bug,
| but still.
|
| When you go to their documentation, the only example of
| actually building a server is this section where they
| just...use an LLM to build it. Here's their prompt:
|
| > Build an MCP server that: - Connects to my company's
| PostgreSQL database - Exposes table schemas as resources -
| Provides tools for running read-only SQL queries - Includes
| prompts for common data analysis tasks
|
| So, I think there's a fairly high chance that all of the
| "reference implementations" they're hosting are AI generated
| with basically no security considerations or code reviews made.
| gcanyon wrote:
| My general method for things like this is to:
|
| 1. Get a .dot file of the database. Many tools will export this.
| 2. Open the .dot in a tool I built for the purpose. 3. Select the
| tables I'm interested in, and export a subset of the .dot file
| representing just those tables and relationships. 4. Hand that
| subset .dot file to the LLM and say, "given this schema, write a
| query -- here's what I want: <rest of the request here>"
|
| That gets the job done 60% of the time. Sometimes when there's an
| incorrect shortcut relationship resulting in the wrong join I'll
| have to redirect with something like, "You need to go through
| <table list> to relate <table X> to <table Y>" That gets my
| success rate up above 95%. I'm not doing ridiculous queries, but
| I am doing recursive aggregations successfully.
| totalhack wrote:
| You are doing a lot of the work a semantic layer would do for
| you. I wonder if you would have better luck having the LLM talk
| to a semantic layer instead of directly to the database.
| jt_b wrote:
| Can you talk more about what an implementation of such a
| semantic layer would look like?
| rishabhparikh wrote:
| Not OP, but a naive guess is it would mean that you'd have
| your schema defined in an ORM (for example, Prisma). The
| advantage here is that the LLM gets context on both the
| schema and how the schema is used throughout the
| application.
| totalhack wrote:
| There are a number of semantic layer tools out there these
| days. Each has their own unique approach, but essentially
| it's a meta layer on top of your database that can be used
| to do things like form queries or provide a consolidated
| API to your data (which may be in multiple databases).
|
| Some comments on this thread mention popular semantic layer
| tools like cube.dev. I also made an open source one that I
| use regularly, though it's currently in I-hope-to-put-more-
| time-into-this-someday mode. Been busy with an acquisition
| this year.
|
| https://github.com/totalhack/zillion
| MattDaEskimo wrote:
| I use Weaviate and let the model create GraphQL queries to
| take advantage of both the semantic and data layer. Not
| sure how efficient it is but it's worked for me
| breadwinner wrote:
| Something like this is what I do also. Is there another way?
| How are others doing it?
| dimitri-vs wrote:
| Export DDL of one or more relevant tables, run query to get a
| random sample of 5 records for each table. Really quick to
| gather all this and enough context to handle most query
| writing tasks with some guidance.
| sitkack wrote:
| Schema + samples. Same thing a skilled person would use.
| Der_Einzige wrote:
| Combining this with structured/constrained generation with
| grammars/pydantic can supercharge this btw
| gcanyon wrote:
| Can you illustrate this a little? Or should I be asking an
| LLM for advice? :-)
| gergely wrote:
| How do you solve the following problems? 1. Whenever you run a
| natural language question the result will be the same for the
| same question? 2. How do you teach an agent/LLM for the datamodel
| of the application that stores the data in the databases?
| zurfer wrote:
| 1. By having one obvious documented way of answering the
| questions or accepting that humans would also come to different
| results.
|
| 2. Investing in good data models and documentating the edge
| cases. Also learning from explicit feedback is powerful.
|
| This helps our customers at getdot.ai get highly reliable
| results.
|
| The models also got a lot better at making sense of weird and
| complex data models in the last months.
| mattpope wrote:
| I feel like this misses the most useful technique for this (or
| really glossed over any techniques at all): flat denormalized
| schemas. If you have an ENUM column, much like a human the LLM
| won't have enough context to query it. And much like a human, 4
| way JOINs get tricky. Having one large sparse maxtrix-like table
| with many boolean columns like `is_sale_complete`,
| `is_sale_cancelled` data warehouse style is straight up the key
| to effectively use an LLM to generate SQL.
| ilaksh wrote:
| On the off chance anyone is interested, I just built a Supabase
| plugin for MindRoot: https://github.com/runvnc/mr_supabase
|
| MindRoot: https://github.com/runvnc/mindroot
|
| Demo of agent using it to update a database after doing research:
| https://www.youtube.com/watch?v=nXwwSj0KSXI
|
| If you give it the actual database password it will use a
| Postgres client to get the actual information_schema and put it
| in the system message. It uses the normal Supabase library for
| everything else.
| runekaagaard wrote:
| I use this every day: https://github.com/runekaagaard/mcp-alchemy
| compacct27 wrote:
| I'm sorry, does this send your database to Claude?
| runekaagaard wrote:
| Yeah, Claude can query and read the result sets. It does not
| send the entire database.
| monkeydust wrote:
| Anyone else wishing their town halls were this sexy?
___________________________________________________________________
(page generated 2025-03-18 23:01 UTC)