[HN Gopher] Building AI agents to query your databases
       ___________________________________________________________________
        
       Building AI agents to query your databases
        
       Author : vortex_ape
       Score  : 150 points
       Date   : 2025-03-14 10:51 UTC (4 days ago)
        
 (HTM) web link (blog.dust.tt)
 (TXT) w3m dump (blog.dust.tt)
        
       | lennythedev wrote:
       | Just messed around with the concept last week. With a good enough
       | schema explanation, bigger reasoning models did an amazing job.
       | Definitely something I'm going to use in my dataflows.
        
       | Tewboo wrote:
       | I've built AI agents for database queries; they streamline
       | processes but require careful design to avoid overloading the
       | system.
        
       | Sayyidalijufri wrote:
       | Cool
        
       | bob1029 wrote:
       | > This abstraction shields users from the complexity of the
       | underlying systems and allows us to add new data sources without
       | changing the user experience.
       | 
       | Cursed mission. These sorts of things do work amazingly well for
       | toy problem domains. But, once you get into more complex business
       | involving 4-way+ joins, things go sideways fast.
       | 
       | I think it might be possible to have a human in the loop during
       | the SQL authoring phase, but there's no way you can do it clean
       | without outside interaction in all cases.
       | 
       | 95% correct might sound amazing at first, but it might as well be
       | 0% in practice. You need to be perfectly correct when working
       | with data in bulk with SQL operations.
        
         | ritz_labringue wrote:
         | It does require writing good instructions for the LLM to
         | properly use the tables, and it works best if you carefully
         | pick the tables that your agent is allowed to use beforehand.
         | We have many users that use it for every day work with real
         | data (definitely not toy problems).
        
           | iLoveOncall wrote:
           | If only we had a language to accurately describe what we want
           | to retrieve from the database! Alas, one can only dream!
        
           | spolu wrote:
           | Yes you are perfectly right. Our product pushes users to be
           | selective on the tables they give access to a given agent for
           | a given use-case :+1:
           | 
           | The tricky part is correctly supporting multiple systems
           | which each have their own specificity. All the way to
           | Salesforce which is an entirely different beast in terms of
           | query language. We're working on it right now and will likely
           | follow-up with a blog post there :+1:
        
             | scottbcovert wrote:
             | Salesforce architect here (from partner firm, not the
             | mothership directly)--Salesforce's query language, SOQL, is
             | definitely a different beast as you say. I'd like to learn
             | more about the issues you're having with the integration,
             | specifically the permissions enforcement. I may be
             | misunderstanding what you meant in the blog post, but if
             | you're passing a SOQL query through the REST API then the
             | results will be scoped by default to the permissions of the
             | user that went through the OAuth flow. My email is in my
             | profile if you're open to connecting.
        
           | troupo wrote:
           | > It does require writing good instructions for the LLM to
           | properly use the tables
           | 
           | --- start quote ---
           | 
           | prompt engineering is nothing but an attempt to reverse-
           | engineer a non-deterministic black box for which any of the
           | parameters below are unknown:
           | 
           | - training set
           | 
           | - weights
           | 
           | - constraints on the model
           | 
           | - layers between you and the model that transform both your
           | input and the model's output that can change at any time
           | 
           | - availability of compute for your specific query
           | 
           | - and definitely some more details I haven't thought of
           | 
           | https://dmitriid.com/prompting-llms-is-not-engineering
           | 
           | --- end quote ---
        
             | aargh_aargh wrote:
             | What else is engineering then if not taming the unknown and
             | the unknowable? How is building a bridge any different? Do
             | you know everything in advance about the composition of
             | terrain, the traffic, the wind and the earthquakes? Or are
             | you making educated guesses about unknown quantities to get
             | something that fits into some parameters that are Good
             | Enough(TM) for the given purpose?
        
               | achierius wrote:
               | > and the unknowable
               | 
               | This is the crux. Sure, for high level software (e.g. Web
               | apps), many parts of the system will feel like black
               | boxes, but low-level software does not generally have
               | this problem. Sure, sometimes you have to deal with a
               | binary blob driver, but more often than not you're in
               | control of or and to debug most all of the software
               | running on your system.
               | 
               | > Building a bridge
               | 
               | There should NOT be significant unknowns when you're
               | building a bridge, this is how people die. You turn those
               | parameters into "knowns with high confidence", which is
               | not something you can even begin to do for the LLM
               | parameters described above.
        
         | spolu wrote:
         | Hi, you are right that things can go sideways fast. In
         | practice, the data that the typical employee needs is also
         | quite simple. So there is definitely a very nice fit for this
         | kind of product with a large number of use-case that we do see
         | provide a lot of value internally for employees (self access to
         | data) and data scientist (reducing loads).
         | 
         | For complex queries/use-cases, we generally instead push our
         | users to create agents that assist them in shaping SQL
         | directly, instead of going directly from text to result/graphs.
         | Pushes them to think more about correctness while still saving
         | them tone of time (the agent has access to the table schemas
         | etc...), but not a good fit for non technical people of course.
        
           | abirch wrote:
           | This works well until it doesn't. As long as there is someone
           | who is responsible for the data correctness. E.g. the
           | cardinality of two joining tables maintained cardinality
           | instead of: there's currently no one in the system in with
           | two locations in the employee_to_location table so it works
           | right now. One it happens there will be the wrong employee
           | count from this query
        
         | ozim wrote:
         | A bit of a joke here:
         | 
         | It solves 100% cases where some manager requests dashboard
         | never to look at it again after one day.
        
           | aargh_aargh wrote:
           | There's no smoke without fire.
        
         | mritchie712 wrote:
         | Using a semantic layer is the cleanest way to have a human in
         | the loop. A human can validate and create all important metrics
         | (e.g. what does "monthly active users" really mean) then an LLM
         | can use that metric definition whenever asked for MAU.
         | 
         | With a semantic layer, you get the added benefit of writing
         | queries in JSON instead of raw SQL. LLM's are much more
         | consistent at writing a small JSON vs. hundreds of lines of
         | SQL.
         | 
         | We[0] use cube[1] for this. It's the best open source semantic
         | layer, but there's a couple closed source options too.
         | 
         | 0 - https://www.definite.app/
         | 
         | 1 - https://cube.dev/
        
           | tillvz wrote:
           | Completely agree! A semantic layer is essential for scaling
           | analytics to enterprise complexity.
           | 
           | Another alternative here is Veezoo [0], which combines the
           | semantic layer (a Knowledge Graph) and self-service analytics
           | into one integrated solution.
           | 
           | We built it specifically for the analytics use-case for both
           | the "data persona" to manage the semantic layer, as well as
           | for the "business persona" to analyze the data.
           | 
           | If you're looking for a semantic-layer + (embedded) BI
           | solution right out of the box. This could be a fit.
           | 
           | 0 - https://www.veezoo.com
        
           | efromvt wrote:
           | JSON generation against a semantic layer and validation loops
           | is definitely the easiest way to get high 9s success for
           | going directly to a correct query from text. For the human in
           | the loop cases, going directly to SQL can be fun - I've toyed
           | with a SQL-like semantic layer that removes the need for
           | direct table access and joins, which removes two of the risk
           | points for LLMs going off the rails while still leveraging a
           | lot of the baked in knowledge about SQL syntax (window
           | functions, transformations, etc) that can be hard to
           | exhaustively bake into the semantic layer. (It's annoying
           | when the semantic layer doesn't quite have what you need.)
        
             | mritchie712 wrote:
             | duckdb has a `json_serialize_sql` function that we've been
             | messing with. It could be an interesting middle ground. It
             | lets you write SQL as JSON.
        
           | pbronez wrote:
           | How well is that working for you?
           | 
           | We use a pattern where we ETL things into tables that model
           | the upstream source closely, then use SQL Views to tighten up
           | the model and integrate across data sources where needed.
           | Keeping this all inside one DB allows us to use tools that
           | understand the schema for autocomplete, etc.
           | 
           | I expect the developer experience would be significantly
           | worse if we started writing views in YAML instead of SQL...
           | but you've found the opposite?
        
             | mritchie712 wrote:
             | Do you care about self-serve analytics (i.e. allowing
             | people that don't know SQL to explore your data)?
             | 
             | A semantic layer is the best way to enable self-serve
             | analytics, but if you don't care about it, it's probably
             | not worth the hassle.
             | 
             | We also use the semantic layer for other nice things like
             | setting goals, column descriptions and other metadata.
        
           | zerop wrote:
           | Currently exploring cube for a "natural language to SQL"
           | solution.
           | 
           | My schema is - 90+ Tables, 2500+ Columns, well documented
           | 
           | From your experience, does Cube look a fit? My use cases will
           | definitely have JOINS.
        
             | mritchie712 wrote:
             | yes, that shouldn't be a problem.
             | 
             | with that many tables, you might want to use Views:
             | https://cube.dev/docs/reference/data-model/view
        
               | zerop wrote:
               | Thanks. sorry, asking more question - Do we need human in
               | the loop with Cube to define the views for all kinds of
               | queries.
               | 
               | In my use case, it's going to be exposed to various kind
               | of stakeholders and there will be versatility of user
               | queries. I can't pre-create views/aggregations for all
               | scenarios.
        
         | fsndz wrote:
         | some people say agents can help. but still difficult to pull it
         | without robust human in the loop: https://medium.com/thoughts-
         | on-machine-learning/build-a-text...
        
       | mritchie712 wrote:
       | We (https://www.definite.app/) solved this a bit differently.
       | 
       | We spin up a data lake and pipelines (we support 500+
       | integrations / connectors) to populate the data lake for you then
       | put DuckDB on top as a single query engine to access all your
       | data.
        
         | yonl wrote:
         | This is really interesting. At my previous company, I built a
         | data lakehouse for operational reporting with recency
         | prioritization (query only recent data, archive the rest).
         | While there was no LLM integration when I left, I've learned
         | from former colleagues that they've since added a lightweight
         | LLM layer on top (though I suspect Dustt's implementation is
         | more comprehensive).
         | 
         | Our main requirement was querying recent operational data
         | across daily/weekly/monthly/quarterly timeframes. The data
         | sources included OLTP binlogs, OLAP views, SFDC, and about 15
         | other marketing platforms. We implemented a datalake with our
         | own query and archival layers. This approach worked well for
         | queries like "conversion rate per channel this quarter" where
         | we needed broad data coverage (all 17 integrations) but
         | manageable depth (reasonable row scanned).
         | 
         | This architecture also enabled quick solutions for additional
         | use cases, like on-the-fly SFDC data enrichment that our
         | analytics team could handle independently. Later, I learned the
         | team integrated LLMs as they began dumping OLAP views inside
         | the datalake for different query types, and eventually replaced
         | our original query layer with DuckDB.
         | 
         | I believe approaches like these (what I had done as in house
         | solution and what definite may be doing more extensively) are
         | data and query-pattern focused first. While it might initially
         | seem like overkill, this approach can withstand organizational
         | complexity challenges - with LLMs serving primarily as an
         | interpretation layer. From skimming the Dustt blog, their
         | approach is refreshing, though it seems their product was built
         | primarily for LLM integration rather than focusing first on
         | data management and scale. They likely have internal mechanisms
         | to handle various use cases that weren't detailed in the blog.
        
         | nicholaschiang wrote:
         | Do you host the data lake yourself? How do you handle data
         | privacy?
         | 
         | Some large enterprises won't allow their data to leave e.g.
         | their own Snowflake warehouse. All tooling they use has to run
         | on top of their warehouse.
        
       | gavinray wrote:
       | Interesting -- I work on a similar tool [1], and the JSON IR
       | representation for your query is similar to the internal IR we
       | used for our data connectors.
       | 
       | [1] https://promptql.hasura.io
        
       | LaGrange wrote:
       | I mean this is generally repulsive, but please I beg of you, run
       | this exclusively against a read only replica. I mean, you should
       | have one for exploratory queries _anyway_, but nobody ever does
       | that.
       | 
       | "Validating the query to ensure it's safe and well-formed" all I
       | can say to that is "ROFL. LMAO."
        
         | RajT88 wrote:
         | Thank you.
         | 
         | "AI Bot, summarize the number of logins this week broken down
         | by email address and password hash".
        
         | cdblades wrote:
         | Yep. MCP is a project run by Anthropic
         | (https://github.com/modelcontextprotocol) that "enables
         | seamless integration between LLM applications and external data
         | sources and tools."
         | 
         | They host a repo with lots of what they call "reference"
         | implementations, including this one for postgres which naively
         | takes a query from a request and shoves it at your database
         | with no validation, escaping, anything:
         | https://github.com/modelcontextprotocol/servers/blob/7d6cdb6...
         | 
         | There's an issue calling this out, and it's been labeled a bug,
         | but still.
         | 
         | When you go to their documentation, the only example of
         | actually building a server is this section where they
         | just...use an LLM to build it. Here's their prompt:
         | 
         | > Build an MCP server that: - Connects to my company's
         | PostgreSQL database - Exposes table schemas as resources -
         | Provides tools for running read-only SQL queries - Includes
         | prompts for common data analysis tasks
         | 
         | So, I think there's a fairly high chance that all of the
         | "reference implementations" they're hosting are AI generated
         | with basically no security considerations or code reviews made.
        
       | gcanyon wrote:
       | My general method for things like this is to:
       | 
       | 1. Get a .dot file of the database. Many tools will export this.
       | 2. Open the .dot in a tool I built for the purpose. 3. Select the
       | tables I'm interested in, and export a subset of the .dot file
       | representing just those tables and relationships. 4. Hand that
       | subset .dot file to the LLM and say, "given this schema, write a
       | query -- here's what I want: <rest of the request here>"
       | 
       | That gets the job done 60% of the time. Sometimes when there's an
       | incorrect shortcut relationship resulting in the wrong join I'll
       | have to redirect with something like, "You need to go through
       | <table list> to relate <table X> to <table Y>" That gets my
       | success rate up above 95%. I'm not doing ridiculous queries, but
       | I am doing recursive aggregations successfully.
        
         | totalhack wrote:
         | You are doing a lot of the work a semantic layer would do for
         | you. I wonder if you would have better luck having the LLM talk
         | to a semantic layer instead of directly to the database.
        
           | jt_b wrote:
           | Can you talk more about what an implementation of such a
           | semantic layer would look like?
        
             | rishabhparikh wrote:
             | Not OP, but a naive guess is it would mean that you'd have
             | your schema defined in an ORM (for example, Prisma). The
             | advantage here is that the LLM gets context on both the
             | schema and how the schema is used throughout the
             | application.
        
             | totalhack wrote:
             | There are a number of semantic layer tools out there these
             | days. Each has their own unique approach, but essentially
             | it's a meta layer on top of your database that can be used
             | to do things like form queries or provide a consolidated
             | API to your data (which may be in multiple databases).
             | 
             | Some comments on this thread mention popular semantic layer
             | tools like cube.dev. I also made an open source one that I
             | use regularly, though it's currently in I-hope-to-put-more-
             | time-into-this-someday mode. Been busy with an acquisition
             | this year.
             | 
             | https://github.com/totalhack/zillion
        
             | MattDaEskimo wrote:
             | I use Weaviate and let the model create GraphQL queries to
             | take advantage of both the semantic and data layer. Not
             | sure how efficient it is but it's worked for me
        
         | breadwinner wrote:
         | Something like this is what I do also. Is there another way?
         | How are others doing it?
        
           | dimitri-vs wrote:
           | Export DDL of one or more relevant tables, run query to get a
           | random sample of 5 records for each table. Really quick to
           | gather all this and enough context to handle most query
           | writing tasks with some guidance.
        
             | sitkack wrote:
             | Schema + samples. Same thing a skilled person would use.
        
         | Der_Einzige wrote:
         | Combining this with structured/constrained generation with
         | grammars/pydantic can supercharge this btw
        
           | gcanyon wrote:
           | Can you illustrate this a little? Or should I be asking an
           | LLM for advice? :-)
        
       | gergely wrote:
       | How do you solve the following problems? 1. Whenever you run a
       | natural language question the result will be the same for the
       | same question? 2. How do you teach an agent/LLM for the datamodel
       | of the application that stores the data in the databases?
        
         | zurfer wrote:
         | 1. By having one obvious documented way of answering the
         | questions or accepting that humans would also come to different
         | results.
         | 
         | 2. Investing in good data models and documentating the edge
         | cases. Also learning from explicit feedback is powerful.
         | 
         | This helps our customers at getdot.ai get highly reliable
         | results.
         | 
         | The models also got a lot better at making sense of weird and
         | complex data models in the last months.
        
       | mattpope wrote:
       | I feel like this misses the most useful technique for this (or
       | really glossed over any techniques at all): flat denormalized
       | schemas. If you have an ENUM column, much like a human the LLM
       | won't have enough context to query it. And much like a human, 4
       | way JOINs get tricky. Having one large sparse maxtrix-like table
       | with many boolean columns like `is_sale_complete`,
       | `is_sale_cancelled` data warehouse style is straight up the key
       | to effectively use an LLM to generate SQL.
        
       | ilaksh wrote:
       | On the off chance anyone is interested, I just built a Supabase
       | plugin for MindRoot: https://github.com/runvnc/mr_supabase
       | 
       | MindRoot: https://github.com/runvnc/mindroot
       | 
       | Demo of agent using it to update a database after doing research:
       | https://www.youtube.com/watch?v=nXwwSj0KSXI
       | 
       | If you give it the actual database password it will use a
       | Postgres client to get the actual information_schema and put it
       | in the system message. It uses the normal Supabase library for
       | everything else.
        
       | runekaagaard wrote:
       | I use this every day: https://github.com/runekaagaard/mcp-alchemy
        
         | compacct27 wrote:
         | I'm sorry, does this send your database to Claude?
        
           | runekaagaard wrote:
           | Yeah, Claude can query and read the result sets. It does not
           | send the entire database.
        
       | monkeydust wrote:
       | Anyone else wishing their town halls were this sexy?
        
       ___________________________________________________________________
       (page generated 2025-03-18 23:01 UTC)