[HN Gopher] Show HN: We open sourced our entire text-to-SQL product
___________________________________________________________________
Show HN: We open sourced our entire text-to-SQL product
Long story short: We (Dataherald) just open-sourced our entire
codebase, including the core engine, the clients that interact with
it and the backend application layer for authentication and RBAC.
You can now use the full solution to build text-to-SQL into your
product. The Problem: modern LLMs write syntactically correct SQL,
but they struggle with real-world relational data. This is because
real world data and schema is messy, natural language can often be
ambiguous and LLMs are not trained on your specific dataset.
Solution: The core NL-to-SQL engine in Dataherald is an LLM based
agent which uses Chain of Thought (CoT) reasoning and a number of
different tools to generate high accuracy SQL from a given user
prompt. The engine achieves this by: - Collecting context at
configuration from the database and sources such as data
dictionaries and unstructured documents which are stored in a data
store or a vector DB and injected if relevant - Allowing users to
upload sample NL <> SQL pairs (golden SQL) which can be used in few
shot prompting or to fine-tune an NL-to-SQL LLM for that specific
dataset - Executing the SQL against the DB to get a few sample
rows and recover from errors - Using an evaluator to assign a
confidence score to the generated SQL The repo includes four
services
https://github.com/Dataherald/dataherald/tree/main/services: 1-
Engine: The core service which includes the LLM agent, vector
stores and DB connectors. 2- Admin Console: a NextJS front-end for
configuring the engine and observability. 3- Enterprise Backend:
Wraps the core engine, adding authentication, caching, and APIs for
the frontend. 4- Slackbot: Integrate Dataherald directly into your
Slack workflow for on-the-fly data exploration. Would love to hear
from the community on building natural language interfaces to
relational data. Anyone live in production without a human in the
loop? Thoughts on how to improve performance without spending weeks
on model training?
Author : aazo11
Score : 130 points
Date : 2024-05-23 15:50 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| instabart wrote:
| Interesting! I am assuming it can do complex joins. Are there any
| examples of text -> sql it produces? I looked on the website but
| only saw "coming soon"
| duegreg wrote:
| It is chatgpt
| aazo11 wrote:
| Yes. There is a schema linking step which identifies relevant
| columns and tables including any foreign key relationships if
| they exist.
|
| The agent also can be finetuned on sample NL <> SQL pairs or
| they can be used in few shot prompting.
| coder543 wrote:
| Have you considered enforcing a grammar on the LLM when it is
| generating SQL? This could ensure that it only generates
| syntactically valid SQL, including awareness of the valid set of
| field names and their types, and such.
|
| It would not be easy, by any means, but I believe it is
| theoretically possible.
| kasmura wrote:
| That cannot be done when using OpenAI API calls as far as I
| know
| coder543 wrote:
| Nobody in the original post or this entire discussion said
| anything about OpenAI until your comment.
|
| I thought it was fairly obvious that we were talking about a
| local LLM agent... if DataHerald is a wrapper around only
| OpenAI, and no other options, then that seems unfortunate.
| aazo11 wrote:
| The agent is LLM agnostic and you can use it with OpenAI or
| self-hosted LLMs. For self hosted LLM we have benchmarked
| performance with Mixtral for tool selection and CodeLlama
| for code generation.
| aazo11 wrote:
| The agent currently executed the generated SQL (limited to 10
| rows) and recovers from errors.
| vlovich123 wrote:
| Are there strategic parts of the stack you haven't open-sourced?
| aazo11 wrote:
| The entirety of the codebase is now open source.
| vlovich123 wrote:
| It's sometimes hard to understand how you keep a business
| running when you've open-sourced your entire stack, both
| consumers self-hosting but even worse would be a competitor
| just taking what you spend R&D budget on & rehosting it with
| a cheaper price since they don't need to pay for that R&D.
| From a business perspective, do you see the operational
| challenge of running your stack at scale as the
| differentiator?
| bobismyuncle wrote:
| Curious why you decided to open source your entire product. Are
| you moving to an open core model? I'd expect in that case that
| much of 2, 3 & 4 would have stayed closed. Would be grateful if
| you can share your reasoning
| robertlagrant wrote:
| This is often the move when the team's spent the money
| developing something and now the end's in sight, so they want
| the chance to leave and take the code with them.
|
| Don't know if this is that at all, but it's always worth
| considering.
| akch wrote:
| Not finding the license anywhere. Which one have you chosen?
| aazo11 wrote:
| Hi -- the license is Apache 2.0
| numlocked wrote:
| Is that documented somewhere? The "contributing" link in the
| readme also 404s. I would definitely need to understand the
| licensing and how contributing works before I could consider
| integrating this (and I would definitely consider it!). Cool
| stuff.
| saigal wrote:
| We'll make the licensing more visible. Stay tuned in a few
| minutes.
| numlocked wrote:
| thanks much!
| threesevenths wrote:
| Guess it's public domain since there is no license
| munk-a wrote:
| Just for future reference - if no license is given it's
| unlicensed. Licensing defaults closed for extremely good
| reasons - that's one of the reasons why github had a strong
| push for users to assign appropriate licensing documents to
| repositories a while back and declare those licenses in
| machine readable forms (if applicable).
| saigal wrote:
| Apache 2.0.
| alchemist1e9 wrote:
| CoT, OPA, CoALA ... these techniques can deliver massive
| performance improvements. Are there any other methods for agent
| frameworks?
|
| any way to follow these developments vs pure LLM research?
| fpater wrote:
| Super cool to see this!! I've been prototyping with NL-to-SQL
| recently, one problem I've stumble into is how to prevent
| mistakes from impacting your database, be it a hallucination or
| even a malicious actor who was able to send a prompt to the LLM
| agent. I don't have much input about the questions you asked
| here, but feel free to contact me (info on my profile) if you'd
| like to talk about those other aspects!!
| aazo11 wrote:
| Sure will reach you out. Currently Dataherald blocks DML or DDL
| commands from being generated/executed.
| arrosenberg wrote:
| I still wonder who the audience is for tools like this. The
| website posits you can answer data questions without going
| through an analyst, but the role of the analyst is not to be a
| SQL whisperer for PMs and Executives - it is to be an expert in
| the model and the data. A data warehouse of any real scale is
| going to have some amount of issues - anomalous data, different
| interpretations of the same numbers - how does the LLM deal with
| that consistently across a business?
| saigal wrote:
| the target audience is developers who wish to embed text to SQL
| functionality into their own products. the target audience is
| less the 'internal use case' (i.e. a data analyst) and more
| about letting external users do things they couldn't do before.
| a good example is payroll software where this type of
| technology can allow users to pull reports.
| arrosenberg wrote:
| I agree that is a more reasonable use-case. The readme for
| this tool seems geared toward the business of answering
| business questions.
| saigal wrote:
| Tbh the original intention was to be the "data analyst" but
| we found over time (and with literally 100s of user
| conversations at small cos and enterprises) the embedded
| use case was more interesting and made for a better
| business, which was not at all what we expected.
| greenavocado wrote:
| > the target audience is developers who wish to embed text to
| SQL functionality into their own products
|
| Who is asking?
| boredemployee wrote:
| you wouldnt believe the amount of developers that don't
| know how to write sql
| saigal wrote:
| hmm not sure I understand the question
| twojacobtwo wrote:
| I think GP meant 'where or from whom have you seen/heard
| demand for this?'.
|
| Weirdly, I was just thinking about using an LLM to form
| sql queries for me, because I've forgotten much of what I
| knew. First time I had that thought and 5 minutes later,
| this fascinating idea rolls into my feed to pull me in
| further. I know I'm not exactly the target audience, but
| now I'm intrigued.
|
| I went through a coding/design bootcamp a while back and
| there was virtually no focus on SQL, so a lot of my
| classmates were hesitant to jump into relational dbs for
| projects. I could see it being used in a tool for new
| devs or those who've focused on a JS stack and need some
| help with SQL.
| saigal wrote:
| We've seen demand from all types of SaaS applications
| where the user might need data-- software that helps
| customer support staff answer data questions, CRM,
| payroll software, just to name a few.
| skydhash wrote:
| > _I could see it being used in a tool for new devs or
| those who 've focused on a JS stack and need some help
| with SQL_
|
| Or they could buy a book like _Learning SQL_. Or spend a
| weekend on Youtube.
| saigal wrote:
| allow me to clarify.. Dataherald isn't intended for
| developers because they don't know SQL, it's intended for
| developers who want to build text to SQL into their
| products
| TheRealPomax wrote:
| With what level of accuracy? And what guarantee of
| correctness? Because a report that happens to get the joins
| wrong once every 1000 reports is going to lead to fun legal
| problems.
|
| You still need someone who understands why you should use
| which approach to get the data you need without getting
| completely wrong numbers back that _look_ perfectly fine but
| reflect fantasy, not reality.
| saigal wrote:
| i agree that there will be "early adopter" type use cases
| and others that might take a while (e.g. healthcare with
| hipaa compliance)
|
| it is still the early days. goal is to give the developer
| tools to do this easier.
| chx wrote:
| Enough of this weasel talk.
|
| It's not the early days.
|
| Not by a country mile.
|
| To quote Cory Doctorow
|
| > I don't see any path from continuous improvements to
| the (admittedly impressive) "machine learning" field that
| leads to a general AI any more than I can see a path from
| continuous improvements in horse-breeding that leads to
| an internal combustion engine.
|
| You can counter it doesn't necessarily need an AGI here
| but that doesn't change the fact you can't crank this
| engine harder and expect it to power an airplane.
|
| And, as always
| https://hachyderm.io/@inthehands/112006855076082650
|
| > You might be surprised to learn that I actually think
| LLMs have the potential to be not only fun but genuinely
| useful. "Show me some bullshit that would be typical in
| this context" can be a genuinely helpful question to have
| answered, in code and in natural language -- for
| brainstorming, for seeing common conventions in an
| unfamiliar context, for having something crappy to react
| to.
|
| > Alas, that does not remotely resemble how people are
| pitching this technology.
| _hzw wrote:
| x
| thedynamicduo wrote:
| This looks really cool, can't wait to check it out. The problem
| I've seen with other tools I've tinkered with is that they do
| well with simple stuff like:
|
| "what are my latest orders" -> select * from orders where
| user_id=x order by created_date
|
| But really struggle when you have a complex schema that requires
| joins, and basically has no support when you are describing
| something that needs outer joins or the like. Would be great to
| hear if DataHerald has cracked that nut or if it's still a
| challenge for you as well (no judgement if it is, it seems like a
| hard problem).
| saigal wrote:
| great question, and the one that we get the most :-) this is
| precisely why we created Dataherald. Off the shelf LLMs can
| handle a single table and simple questions. Dataherald's quest
| is to ultimately provide enterprise-grade text to SQL, where
| complex schema and joins are present. it does take some
| training, but we've found that it can handle situations such as
| the one you mention above.
| mholubowski wrote:
| Hey! Why did you open source it? Genuinely curious.
| tootie wrote:
| Is the only LLM support OpenAI?
| saigal wrote:
| "The agent is LLM agnostic and you can use it with OpenAI or
| self-hosted LLMs."
| totalhack wrote:
| Is this more like text-to-semantic layer or does it throw the
| schema in the prompt and generate SQL with the llm?
| aazo11 wrote:
| This is not a text to semantic layer but it does far more than
| just inject schema into the prompt:
|
| - the engine keeps an updated catalog of the data (low
| cardinality columns, their values etc) - taps into query
| history and finetunes the model to the schema - allows
| uploading context from unstructured sources like docs and data
| dictionaries - has an agent which collects all relevant info,
| generate the SQL, tries to retrieve a few rows to recover from
| errors and provides an confidence score to the generated SQL
| winphone1974 wrote:
| SQL is really close to a natural language that's unambiguous,
| there's a few rough edges but it's not bad. Anything more natural
| requires a lot of context and needs to solve ambiguity.
| saigal wrote:
| while i agree, there is clear demand for people to use natural
| language to SQL. we have tremendous conviction around the
| desire for natural language tools, but of course the technology
| and product need to deliver desired results.
| saigal wrote:
| "Anything more natural requires a lot of context and needs to
| solve ambiguity.
|
| this is precisely why we created Dataherald-- to make it much
| easier to add that business context so that NL to SQL could
| actually be good enough to get into production
| kwerk wrote:
| Will it work with GraphQL?
| aazo11 wrote:
| Currently does not but looking to add support. Would love to
| connect and learn more about your use case.
| throwaway115 wrote:
| What guarantees do you offer with query security if I turn this
| over to an end user? How do I keep them only accessing their own
| data?
| freeone3000 wrote:
| Any number of database namespacing techniques already present
| in postgresql can prevent this. Link the user sign-on to a DB
| user and you're gold.
___________________________________________________________________
(page generated 2024-05-24 23:00 UTC)