[HN Gopher] Show HN: We open sourced our entire text-to-SQL product
       ___________________________________________________________________
        
       Show HN: We open sourced our entire text-to-SQL product
        
       Long story short: We (Dataherald) just open-sourced our entire
       codebase, including the core engine, the clients that interact with
       it and the backend application layer for authentication and RBAC.
       You can now use the full solution to build text-to-SQL into your
       product.  The Problem: modern LLMs write syntactically correct SQL,
       but they struggle with real-world relational data. This is because
       real world data and schema is messy, natural language can often be
       ambiguous and LLMs are not trained on your specific dataset.
       Solution: The core NL-to-SQL engine in Dataherald is an LLM based
       agent which uses Chain of Thought (CoT) reasoning and a number of
       different tools to generate high accuracy SQL from a given user
       prompt. The engine achieves this by:  - Collecting context at
       configuration from the database and sources such as data
       dictionaries and unstructured documents which are stored in a data
       store or a vector DB and injected if relevant  - Allowing users to
       upload sample NL <> SQL pairs (golden SQL) which can be used in few
       shot prompting or to fine-tune an NL-to-SQL LLM for that specific
       dataset  - Executing the SQL against the DB to get a few sample
       rows and recover from errors  - Using an evaluator to assign a
       confidence score to the generated SQL  The repo includes four
       services
       https://github.com/Dataherald/dataherald/tree/main/services:  1-
       Engine: The core service which includes the LLM agent, vector
       stores and DB connectors.  2- Admin Console: a NextJS front-end for
       configuring the engine and observability.  3- Enterprise Backend:
       Wraps the core engine, adding authentication, caching, and APIs for
       the frontend.  4- Slackbot: Integrate Dataherald directly into your
       Slack workflow for on-the-fly data exploration.  Would love to hear
       from the community on building natural language interfaces to
       relational data. Anyone live in production without a human in the
       loop? Thoughts on how to improve performance without spending weeks
       on model training?
        
       Author : aazo11
       Score  : 433 points
       Date   : 2024-05-23 15:50 UTC (2 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | instabart wrote:
       | Interesting! I am assuming it can do complex joins. Are there any
       | examples of text -> sql it produces? I looked on the website but
       | only saw "coming soon"
        
         | aazo11 wrote:
         | Yes. There is a schema linking step which identifies relevant
         | columns and tables including any foreign key relationships if
         | they exist.
         | 
         | The agent also can be finetuned on sample NL <> SQL pairs or
         | they can be used in few shot prompting.
        
       | coder543 wrote:
       | Have you considered enforcing a grammar on the LLM when it is
       | generating SQL? This could ensure that it only generates
       | syntactically valid SQL, including awareness of the valid set of
       | field names and their types, and such.
       | 
       | It would not be easy, by any means, but I believe it is
       | theoretically possible.
        
         | kasmura wrote:
         | That cannot be done when using OpenAI API calls as far as I
         | know
        
           | coder543 wrote:
           | Nobody in the original post or this entire discussion said
           | anything about OpenAI until your comment.
           | 
           | I thought it was fairly obvious that we were talking about a
           | local LLM agent... if DataHerald is a wrapper around only
           | OpenAI, and no other options, then that seems unfortunate.
        
             | aazo11 wrote:
             | The agent is LLM agnostic and you can use it with OpenAI or
             | self-hosted LLMs. For self hosted LLM we have benchmarked
             | performance with Mixtral for tool selection and CodeLlama
             | for code generation.
        
         | aazo11 wrote:
         | The agent currently executed the generated SQL (limited to 10
         | rows) and recovers from errors.
        
         | Kiro wrote:
         | That sounds overkill. It's usually enough to just tell the LLM
         | to output valid SQL and it will adhere to the schema.
        
           | lmeyerov wrote:
           | In our experience building louie.ai for a continuous learning
           | variant of text2query (and for popular DBs beyond SQL),
           | getting syntax right via a symbolic lint phase is a nice
           | speedup, but not the a correctness issue. For syntax, bigger
           | LLMs are generally right on the first shot, and an agent loop
           | autocorrects quickly when the DB gives a syntax error.
           | 
           | Much more time for us goes to things like:
           | 
           | * Getting the right table, column name spelling
           | 
           | * Disambiguating typos when users define names, and deciding
           | whether they mean a specific name or are using a shorthand
           | 
           | * Disambiguating selection when there are multiple for the
           | same thing: hint - this needs to be learned from usage, not
           | by static schema analysis
           | 
           | * Guard rails, such as on perf
           | 
           | * Translation from non-technical user concepts to analyst
           | concepts
           | 
           | * Enterprise DB schemas are generally large and often blow
           | out the LLM context window, or make things slow, expensive,
           | and lossy if you rely on giant context windows
           | 
           | * Learning and team modes so the model improves over time.
           | User teaching interfaces are especially tricky once you
           | expose them - learning fuzzy vs explicit modes, avoid data
           | leakage, ... .
           | 
           | * A lot of power comes from being part of an agentic loop
           | with other tools like Python and charting, which creates a
           | 'composition' problem that requires AI optimization across
           | any sub-AIs
           | 
           | We have been considering OSS this layer of louie.ai, but it
           | hasn't been a priority for our customers, who are the analyst
           | orgs using our UIs on top (Splunk, OpenSearch, Neo4j,
           | Databricks, ...), and occasionally building their own
           | internal tools in top of our API. Our focus has been building
           | a sustainable and high quality project, and these OSS
           | projects seem to be very different to sustain without also
           | solving that, which is hard enough as-is..
        
       | vlovich123 wrote:
       | Are there strategic parts of the stack you haven't open-sourced?
        
         | aazo11 wrote:
         | The entirety of the codebase is now open source.
        
           | vlovich123 wrote:
           | It's sometimes hard to understand how you keep a business
           | running when you've open-sourced your entire stack, both
           | consumers self-hosting but even worse would be a competitor
           | just taking what you spend R&D budget on & rehosting it with
           | a cheaper price since they don't need to pay for that R&D.
           | From a business perspective, do you see the operational
           | challenge of running your stack at scale as the
           | differentiator?
        
             | fhd2 wrote:
             | Not affiliated with OP and therefore unable to answer your
             | question, but there's a lot of products that built traction
             | that way: WordPress, GitLab, Discourse, Docker, Ubuntu, ...
             | 
             | I think it solves the problem of gaining traction today, at
             | the expense of future market power. Then they face a choice
             | of pulling a HashiCorp or being OK with being a commodity
             | provider rather than a fancy unicorn.
             | 
             | I can see the appeal, a humble business is better than no
             | business, isn't it?
        
       | bobismyuncle wrote:
       | Curious why you decided to open source your entire product. Are
       | you moving to an open core model? I'd expect in that case that
       | much of 2, 3 & 4 would have stayed closed. Would be grateful if
       | you can share your reasoning
        
         | robertlagrant wrote:
         | This is often the move when the team's spent the money
         | developing something and now the end's in sight, so they want
         | the chance to leave and take the code with them.
         | 
         | Don't know if this is that at all, but it's always worth
         | considering.
        
           | e1g wrote:
           | That is almost certainly what's happening here. They raised
           | $3M three years ago, at the peak of evaluations, and don't
           | have the metrics to raise a Series A in the current climate.
           | Running out of money and want to leave some artifact behind.
           | A very difficult and emotional transition.
        
           | hackernewds wrote:
           | I don't understand the "leave the code with them" part
        
             | learnedbytes wrote:
             | I think they mean by open sourcing, they can take the code
             | to a new startup without having IP legality issues.
        
               | tomhallett wrote:
               | Would open sourcing the core IP of a company "typically"
               | require board approval?
               | 
               | If a company goes under, the investors will want to sell
               | off the IP, open sourcing everything would make that IP
               | less valueable. There must be some blanket clause in the
               | term sheet to cover that, right? Ie: founders won't do
               | anything which will materially hurt the company without
               | board approval (or something, I am no where close to a
               | lawyer, this is all conjecture)
        
               | throwup238 wrote:
               | If it mattered it would have become part of VC contracts
               | years ago.
               | 
               | Early stage VCs make money on the big winners, not on the
               | tail end of companies that don't exit for 100x. For the
               | most part, except for patents the IP is worth less than
               | the Aeron chairs at the end.
        
               | re-thc wrote:
               | > Would open sourcing the core IP of a company
               | "typically" require board approval?
               | 
               | At this stage, if they can't raise a series A I'd assume
               | they still have a majority of the "board" to themselves.
        
             | cess11 wrote:
             | You can look at the history of Erlang for a similar
             | example. A very crude summary could be that Ericsson
             | developed the language, took it to production, and then
             | they decided to replace it with Java. The people that did
             | the language design convinced management to release Erlang
             | and the VM as FOSS, and then they promptly went and started
             | a company that could use the tooling they'd developed.
             | 
             | I'm aware I'm leaving out a lot of detail, but it's not
             | clear to me what has become public knowledge and what has
             | not and I happen to know some people that were involved.
        
         | krawczstef wrote:
         | getting into enterprise is hard, so probably trying open source
         | to help with that.
        
           | saigal wrote:
           | Enterprises are spending lots of time and money on this. The
           | biggest issue that has slowed down sales cycle at this stage
           | has been data governance. Most folks think it's about
           | accuracy or latency (which of course is an issue) but data
           | governance can make this whole thing a non starter.
        
             | edmundsauto wrote:
             | Can you explain more about why governance is the issue with
             | a service like this? Companies not wanting their data to go
             | off prem?
        
               | saigal wrote:
               | yes. some want BYOC solutions. others don't want to even
               | be perceived as being used to train an LLM. not to
               | mention CCPA, GDPR, etc etc etc.
               | 
               | lots of questions around what data is being sent to the
               | LLM, or just schema.
        
               | edmundsauto wrote:
               | Interesting. So by open sourcing you think companies can
               | self host and it negates some of these issues? Or is your
               | goal into increase future contributions to keep the
               | project alive and developing?
               | 
               | What % of the NL -> SQL problem is solved in the current
               | version? Ie is this something ready for some type of prod
               | work now, or is it "in 2-3 years we'll be there"?
        
               | threecheese wrote:
               | Not OP, but there was an EHR SaaS company on HN a day or
               | two back with a similar proposition: it's open source, so
               | it can be independently verified from a security
               | perspective. It was interesting to me because the code
               | was unusable to normal folks, and even other companies -
               | one of the founders described their moat being the
               | trouble of actually integrating with the ecosystem, and
               | weren't worried about competitors using it. It really
               | hammered home to me how open source is more and more a
               | marketing lever lately.
        
               | aazo11 wrote:
               | There are organizations using Dataherald in production
               | right now.
               | 
               | The latency is ~20-30s and it takes some set up, so as
               | long as those are not blockers it can be used in prod.
        
               | saigal wrote:
               | For companies that are willing to put in some effort, the
               | self hosting option is a great one. There are certain use
               | cases where this works now, and is already in production.
               | These tend to be use cases with some constraints and
               | don't deal with very sensitive data.
        
       | akch wrote:
       | Not finding the license anywhere. Which one have you chosen?
        
         | aazo11 wrote:
         | Hi -- the license is Apache 2.0
        
           | numlocked wrote:
           | Is that documented somewhere? The "contributing" link in the
           | readme also 404s. I would definitely need to understand the
           | licensing and how contributing works before I could consider
           | integrating this (and I would definitely consider it!). Cool
           | stuff.
        
             | saigal wrote:
             | We'll make the licensing more visible. Stay tuned in a few
             | minutes.
        
               | numlocked wrote:
               | thanks much!
        
               | aazo11 wrote:
               | Added the License
        
         | threesevenths wrote:
         | Guess it's public domain since there is no license
        
           | munk-a wrote:
           | Just for future reference - if no license is given it's
           | unlicensed. Licensing defaults closed for extremely good
           | reasons - that's one of the reasons why github had a strong
           | push for users to assign appropriate licensing documents to
           | repositories a while back and declare those licenses in
           | machine readable forms (if applicable).
        
             | saigal wrote:
             | Apache 2.0.
        
       | alchemist1e9 wrote:
       | CoT, OPA, CoALA ... these techniques can deliver massive
       | performance improvements. Are there any other methods for agent
       | frameworks?
       | 
       | any way to follow these developments vs pure LLM research?
        
       | fpater wrote:
       | Super cool to see this!! I've been prototyping with NL-to-SQL
       | recently, one problem I've stumble into is how to prevent
       | mistakes from impacting your database, be it a hallucination or
       | even a malicious actor who was able to send a prompt to the LLM
       | agent. I don't have much input about the questions you asked
       | here, but feel free to contact me (info on my profile) if you'd
       | like to talk about those other aspects!!
        
         | aazo11 wrote:
         | Sure will reach you out. Currently Dataherald blocks DML or DDL
         | commands from being generated/executed.
        
       | arrosenberg wrote:
       | I still wonder who the audience is for tools like this. The
       | website posits you can answer data questions without going
       | through an analyst, but the role of the analyst is not to be a
       | SQL whisperer for PMs and Executives - it is to be an expert in
       | the model and the data. A data warehouse of any real scale is
       | going to have some amount of issues - anomalous data, different
       | interpretations of the same numbers - how does the LLM deal with
       | that consistently across a business?
        
         | saigal wrote:
         | the target audience is developers who wish to embed text to SQL
         | functionality into their own products. the target audience is
         | less the 'internal use case' (i.e. a data analyst) and more
         | about letting external users do things they couldn't do before.
         | a good example is payroll software where this type of
         | technology can allow users to pull reports.
        
           | arrosenberg wrote:
           | I agree that is a more reasonable use-case. The readme for
           | this tool seems geared toward the business of answering
           | business questions.
        
             | saigal wrote:
             | Tbh the original intention was to be the "data analyst" but
             | we found over time (and with literally 100s of user
             | conversations at small cos and enterprises) the embedded
             | use case was more interesting and made for a better
             | business, which was not at all what we expected.
        
               | rkuodys wrote:
               | Could you share how products integrate txt to sql within
               | a product? Very curious
        
               | saigal wrote:
               | Search bar within the SaaS interface that allows user to
               | ask data questions and returns back NL answer or specific
               | cut of data
        
           | greenavocado wrote:
           | > the target audience is developers who wish to embed text to
           | SQL functionality into their own products
           | 
           | Who is asking?
        
             | boredemployee wrote:
             | you wouldnt believe the amount of developers that don't
             | know how to write sql
        
               | threeseed wrote:
               | Because ORM libraries were invented 30 years ago.
               | 
               | There is no requirement to learn SQL for most of the
               | applications built today.
        
               | sfn42 wrote:
               | ORM doesn't really excuse you from understanding what's
               | going on. In a way using ORM is more difficult because
               | you have to understand both what sql you want and how to
               | get the framework to generate it for you.
               | 
               | Of course there's a lot of incompetent people who have no
               | idea what they're doing, if it seems to work they ship
               | it. That leads to a lot of nonsensical bullshit and
               | unnecessarily slow systems.
        
               | lelanthran wrote:
               | > Because ORM libraries were invented 30 years ago.
               | 
               | > There is no requirement to learn SQL for most of the
               | applications built today.
               | 
               | In the same way that because Linked List libraries were
               | invented 50 years ago, there is no requirement to learn
               | what linked lists are for most of the applications built
               | today?
               | 
               | You aren't getting past the requirement to learn
               | relational databases "because ORM", and there is no
               | material or course that teaches relational databases
               | without teaching SQL.
               | 
               | The unfortunate result of this is that people who boast
               | about knowing $ORM while not knowing SQL have never
               | learned relational databases either.
        
               | meekaaku wrote:
               | Most applications dont need to get data from a relational
               | database. But for those apps that do, knowing SQL is
               | pretty much a must have. The developer himself or someone
               | on the team.
        
               | cerved wrote:
               | Depends how good you want the application to be
        
             | saigal wrote:
             | hmm not sure I understand the question
        
               | twojacobtwo wrote:
               | I think GP meant 'where or from whom have you seen/heard
               | demand for this?'.
               | 
               | Weirdly, I was just thinking about using an LLM to form
               | sql queries for me, because I've forgotten much of what I
               | knew. First time I had that thought and 5 minutes later,
               | this fascinating idea rolls into my feed to pull me in
               | further. I know I'm not exactly the target audience, but
               | now I'm intrigued.
               | 
               | I went through a coding/design bootcamp a while back and
               | there was virtually no focus on SQL, so a lot of my
               | classmates were hesitant to jump into relational dbs for
               | projects. I could see it being used in a tool for new
               | devs or those who've focused on a JS stack and need some
               | help with SQL.
        
               | saigal wrote:
               | We've seen demand from all types of SaaS applications
               | where the user might need data-- software that helps
               | customer support staff answer data questions, CRM,
               | payroll software, just to name a few.
        
               | skydhash wrote:
               | > _I could see it being used in a tool for new devs or
               | those who 've focused on a JS stack and need some help
               | with SQL_
               | 
               | Or they could buy a book like _Learning SQL_. Or spend a
               | weekend on Youtube.
        
               | saigal wrote:
               | allow me to clarify.. Dataherald isn't intended for
               | developers because they don't know SQL, it's intended for
               | developers who want to build text to SQL into their
               | products
        
               | nicoburns wrote:
               | But who wants text-to-sql in products that they use? You
               | wouldn't be able to trust the results. So what is it
               | useful for? Of course you could learn to check the
               | output. But then you could just learn SQL. I know dozens
               | of not particularly technical people (certainly not
               | software developers) who have learnt enough SQL to be
               | useful over a couple of days.
        
               | aazo11 wrote:
               | While the engine response is not accurate all the time,
               | the engine returns a confidence score. We have never
               | encountered cases where a deployment with necessary
               | training data indicates a .9 confidence score on an
               | incorrectly generated SQL.
        
               | Kiro wrote:
               | The demand is huge. Accuracy is less important because
               | the alternative is being completely in the dark or wait
               | for a developer to get the data for you. In my experience
               | people want to quickly get a ballpark number before they
               | dig deeper.
               | 
               | I agree that you should just learn SQL but that doesn't
               | change the fact that a lot of companies want this right
               | now. SQLAI claims to have hundreds of thousands of
               | customers.
        
               | saigal wrote:
               | Couldn't agree with this more. Exactly.
        
               | edmundsauto wrote:
               | I think a lot of people want something like this.
               | Especially as more non technical people are adding
               | business analysts to their jd.
               | 
               | I've tried to teach SQL to PMs, bug triage specialists,
               | etc. even a couple of days is too much time for them to
               | learn something not critical or core to their job. Their
               | alternative is to bug data teams with adhoc requests,
               | which data people hate.
               | 
               | A tool like this would probably save 15% of a data teams
               | time, and reduce the worst part of their job. At
               | companies with hundreds, or even thousands, of data folks
               | - that's massive
               | 
               | And the users are smart people. They can read SQL to see
               | if it looks like the right filters are applied. The
               | "accuracy" issue exists but for certain use cases, it's
               | honestly not the biggest concern.
               | 
               | Not sure why the tone in this thread is so negative. To
               | the founders, thank you!
        
               | saigal wrote:
               | we've encountered a lot of instances when people know SQL
               | but just want a first draft of SQL to expedite the
               | process. we see this a lot from data analysts too.
        
           | TheRealPomax wrote:
           | With what level of accuracy? And what guarantee of
           | correctness? Because a report that happens to get the joins
           | wrong once every 1000 reports is going to lead to fun legal
           | problems.
           | 
           | You still need someone who understands why you should use
           | which approach to get the data you need without getting
           | completely wrong numbers back that _look_ perfectly fine but
           | reflect fantasy, not reality.
        
             | saigal wrote:
             | i agree that there will be "early adopter" type use cases
             | and others that might take a while (e.g. healthcare with
             | hipaa compliance)
             | 
             | it is still the early days. goal is to give the developer
             | tools to do this easier.
        
               | chx wrote:
               | Enough of this weasel talk.
               | 
               | It's not the early days.
               | 
               | Not by a country mile.
               | 
               | To quote Cory Doctorow
               | 
               | > I don't see any path from continuous improvements to
               | the (admittedly impressive) "machine learning" field that
               | leads to a general AI any more than I can see a path from
               | continuous improvements in horse-breeding that leads to
               | an internal combustion engine.
               | 
               | You can counter it doesn't necessarily need an AGI here
               | but that doesn't change the fact you can't crank this
               | engine harder and expect it to power an airplane.
               | 
               | And, as always
               | https://hachyderm.io/@inthehands/112006855076082650
               | 
               | > You might be surprised to learn that I actually think
               | LLMs have the potential to be not only fun but genuinely
               | useful. "Show me some bullshit that would be typical in
               | this context" can be a genuinely helpful question to have
               | answered, in code and in natural language -- for
               | brainstorming, for seeing common conventions in an
               | unfamiliar context, for having something crappy to react
               | to.
               | 
               | > Alas, that does not remotely resemble how people are
               | pitching this technology.
        
               | warkdarrior wrote:
               | Indeed, AI is not marketed as a BS generator, just as
               | HTTP is not marketed as a spam/ad/fraud/harassment
               | transport protocol. All technologies are dual-use, deal
               | with it!
        
               | EasyMark wrote:
               | There's the old adage of "trust, but verify" with LLM's
               | I'm feeling it more like "Acknowledge, but verify, and
               | verify again". It has certainly pointed me in the right
               | direction faster vs google "here's some SEO stuff to sort
               | through" :)
        
               | saigal wrote:
               | I agree with you. The larger point with text to SQL,
               | however, is that it will not work if it is a simple wrap
               | of an LLM (GPT or otherwise). Text to SQL will only work
               | if there is a sufficient understanding of the business
               | context required. To do this is hard, but with tools such
               | as Dataherald a dev's life gets a whole lot easier.
        
               | chx wrote:
               | what is your affiliation with Dataherald
        
               | cess11 wrote:
               | Likely co-founder and CEO:
               | https://www.dataherald.com/company
        
               | saigal wrote:
               | Yes. Correct.
        
               | Terr_ wrote:
               | > can't crank this engine harder and expect it to power
               | an airplane.
               | 
               | Similarly, but from my far-less notable-self in another
               | discussion today:
               | 
               | > [H]uman exuberance is riding on the (questionable) idea
               | that a really good text-correlation specialist can
               | effectively impersonate a general AI.
               | 
               | > Even worse: Some people assume an exceptional text-
               | specialist model will effectively meta-impersonate a
               | generalist model impersonating a different kind of
               | specialist!
        
               | altdataseller wrote:
               | Its not the early days in terms of expecting digital
               | tools to be correct 99% of the time. Early adoption age
               | was back in 2000-2009. Now everyone expects polished
               | tools that does what it expects them to do
        
               | saigal wrote:
               | "...what it expects them to do"
               | 
               | therein lies the nuance. some people expect to get a
               | natural language answer back. others expect to get a data
               | table back. others expect to get correct SQL back. this
               | is why it's so important to understand the use case and
               | not bucket everything together.
        
               | saigal wrote:
               | if you expect correct 99% of the time, you will be
               | waiting for a very very very long time for most, except
               | for the most constrained, use cases
        
             | ecjhdnc2025 wrote:
             | > With what level of accuracy? And what guarantee of
             | correctness? Because a report that happens to get the joins
             | wrong once every 1000 reports is going to lead to fun legal
             | problems.
             | 
             | The truth is everyone knows LLMs can't tell correct from
             | error, can't tell real from imagined, and cannot care.
             | 
             | The word "hallucinate" has been used to explain when an LLM
             | gets things wrong, when it's equally applicable to when it
             | _gets things right_.
             | 
             | Everyone thinks the hallucinations can be trained out,
             | leaving only edge cases. But in reality, edge cases are
             | often horror stories. And an LLM edge case isn't a known
             | quantity for which, say, limits, tolerances and test suites
             | can really do the job. Because there's nobody with domain
             | skill saying, look, this is safe or viable within these
             | limits.
             | 
             | All LLM products are built with the same intention: we can
             | use this to replace real people or expertise that is
             | expensive to develop, or sell it to companies on that
             | basis.
             | 
             | If it goes wrong, they know the excited customer will
             | invest an unbillable amount of time re-training the LLM or
             | double-checking its output -- developing a new unnecessary,
             | tangential skill or still spending time doing what the LLM
             | was meant to replace.
             | 
             | But hopefully you only need a handful of such babysitters,
             | right? And if it goes really wrong there are disclaimers
             | and legal departments.
        
             | panarky wrote:
             | Getting joins wrong once in 1000 queries would beat 99.9%
             | of experienced data analysts.
             | 
             | Our standards for AI are too high.
             | 
             | If an autonomous car causes one wreck per ten million
             | miles, people set the cars on fire.
             | 
             | When someone finds an LLM that suggests eating a small rock
             | every day, that anecdote is used to discredit all LLM
             | results.
             | 
             | This shit makes errors. But what is the alternative? Human
             | analysts who get joins wrong four times in ten? Human
             | drivers who cause wrecks 30 times per ten million miles?
             | Human social media recommendations about nutritional
             | supplements?
        
               | saigal wrote:
               | The autonomous car analogy is a good one. The technology
               | is overall so far superior to a human (probably scrolling
               | TikTok) driving but the moment it makes a mistake we
               | remove the AEV which would be to to higher societal
               | benefit.
               | 
               | Decisions should be made against an alternative, not
               | against some fictitious perfect solution.
        
         | hackernewds wrote:
         | This is the grievance I have as a data scientist. It is one of
         | the fields where things are technical, while meanwhile everyone
         | thinks they could do the job and provide excessive input and
         | exact direction.
        
           | saigal wrote:
           | there is a middle ground here. the most complicated queries
           | will need the intel and business context of a smart data
           | scientist. there are however so many types of queries where
           | automation would make the world so much easier and allow more
           | self-serve type data inquiries. too often the rhetoric around
           | these topics is binary as in "it works" or "it doesn't work."
           | in reality, there are certain use cases that work now and
           | others that don't yet.
        
       | _hzw wrote:
       | x
        
       | thedynamicduo wrote:
       | This looks really cool, can't wait to check it out. The problem
       | I've seen with other tools I've tinkered with is that they do
       | well with simple stuff like:
       | 
       | "what are my latest orders" -> select * from orders where
       | user_id=x order by created_date
       | 
       | But really struggle when you have a complex schema that requires
       | joins, and basically has no support when you are describing
       | something that needs outer joins or the like. Would be great to
       | hear if DataHerald has cracked that nut or if it's still a
       | challenge for you as well (no judgement if it is, it seems like a
       | hard problem).
        
         | saigal wrote:
         | great question, and the one that we get the most :-) this is
         | precisely why we created Dataherald. Off the shelf LLMs can
         | handle a single table and simple questions. Dataherald's quest
         | is to ultimately provide enterprise-grade text to SQL, where
         | complex schema and joins are present. it does take some
         | training, but we've found that it can handle situations such as
         | the one you mention above.
        
           | BossingAround wrote:
           | Perhaps orthogonal problem - imagine you join a new company
           | that has an enterprise product with hundreds of tables. Is
           | there a way to connect Dataherald to my DB, and ask basic
           | questions about the DB? E.g. "where are stored records
           | related to X".
        
             | aazo11 wrote:
             | Yes when you connect Dataherald to a DB it scans it and you
             | can do exploratory queries.
        
               | momothereal wrote:
               | What happens when the tables and columns have cryptic
               | names/acronyms? Do you need to inject documentation?
        
           | roughly wrote:
           | So from the look of this, you're open sourcing all of the
           | agent/usage code, but I'm guessing you're keeping the trained
           | model in-house, so that becomes the value prop for Dataherald
           | the product - the trained LLM?
        
       | mholubowski wrote:
       | Hey! Why did you open source it? Genuinely curious.
        
       | tootie wrote:
       | Is the only LLM support OpenAI?
        
         | saigal wrote:
         | "The agent is LLM agnostic and you can use it with OpenAI or
         | self-hosted LLMs."
        
       | totalhack wrote:
       | Is this more like text-to-semantic layer or does it throw the
       | schema in the prompt and generate SQL with the llm?
        
         | aazo11 wrote:
         | This is not a text to semantic layer but it does far more than
         | just inject schema into the prompt:
         | 
         | - the engine keeps an updated catalog of the data (low
         | cardinality columns, their values etc) - taps into query
         | history and finetunes the model to the schema - allows
         | uploading context from unstructured sources like docs and data
         | dictionaries - has an agent which collects all relevant info,
         | generate the SQL, tries to retrieve a few rows to recover from
         | errors and provides an confidence score to the generated SQL
        
       | winphone1974 wrote:
       | SQL is really close to a natural language that's unambiguous,
       | there's a few rough edges but it's not bad. Anything more natural
       | requires a lot of context and needs to solve ambiguity.
        
         | saigal wrote:
         | while i agree, there is clear demand for people to use natural
         | language to SQL. we have tremendous conviction around the
         | desire for natural language tools, but of course the technology
         | and product need to deliver desired results.
        
         | saigal wrote:
         | "Anything more natural requires a lot of context and needs to
         | solve ambiguity.
         | 
         | this is precisely why we created Dataherald-- to make it much
         | easier to add that business context so that NL to SQL could
         | actually be good enough to get into production
        
       | kwerk wrote:
       | Will it work with GraphQL?
        
         | aazo11 wrote:
         | Currently does not but looking to add support. Would love to
         | connect and learn more about your use case.
        
         | saigal wrote:
         | No
        
       | throwaway115 wrote:
       | What guarantees do you offer with query security if I turn this
       | over to an end user? How do I keep them only accessing their own
       | data?
        
         | freeone3000 wrote:
         | Any number of database namespacing techniques already present
         | in postgresql can prevent this. Link the user sign-on to a DB
         | user and you're gold.
        
           | throwaway115 wrote:
           | What? How does that ensure user 123 only generates LLM
           | queries that constrain on rows where user=123?
        
             | aazo11 wrote:
             | As I wrote on the original thread, we recommend using the
             | RDBMS row-level security features.
             | 
             | This blog discusses how to do that on Postgres
             | 
             | https://www.2ndquadrant.com/en/blog/application-users-vs-
             | row...
        
               | altdataseller wrote:
               | Way way too complicated. I thought this tool was suppsed
               | to make my life easier
        
               | saigal wrote:
               | is there an easier way?
        
               | altdataseller wrote:
               | Yes write SQL
        
               | saigal wrote:
               | The question was around row level security
        
         | aazo11 wrote:
         | We recommend users leverage row-level security features built
         | into modern RDBMS so the query results only return data for a
         | given user.
         | 
         | You can read more on how to do that on Postgres here
         | https://www.2ndquadrant.com/en/blog/application-users-vs-row...
        
           | throwaway115 wrote:
           | Where do you recommend this? It sounds dangerous for
           | databases that do not implement RLS, like Mysql, MariaDb,
           | Sqlite. I think you should highlight that very clearly
           | somewhere.
        
       | RyanHamilton wrote:
       | >Would love to hear from the community on building natural
       | language interfaces to relational data.
       | 
       | I produce a free sql editor that allows users to plugin openai to
       | perform text to sql: https://www.timestored.com/qstudio/help/ai-
       | text2sql so far uptake is slow and the only good benefit is to
       | spit out a few queries as a starting point. The accuracy went up
       | significantly by sending schema and sample data but it sounds
       | like you've done a good job at going beyond that. I wouldn't say
       | my users or I am convinced it's the future but I'll certainly
       | look at your product tomorrow. Good work and congratulations.
        
         | saigal wrote:
         | Yes please do. We'd love your feedback and or to hear whether
         | you see material improvement over what you have now
        
       | fkm0r0ns wrote:
       | It's a great idea.
       | 
       | Of course, there will be some ambiguities, but maybe over time
       | you can somehow constrain the input language a bit, adding some
       | structure to it, such that you can query a database in English-
       | like syntax without any ambiguities.
       | 
       | That would be nice!
        
       | iandanforth wrote:
       | Am I misreading this code? It looks like you don't have
       | precomputed table representations and search, instead you scan,
       | embed, and compare on each run?
       | 
       | https://github.com/Dataherald/dataherald/blob/main/services/...
        
         | aazo11 wrote:
         | Tables, columns and views are scanned at configuration time (or
         | based on an API trigger) and stored in the data store and a
         | vector store, not on every run.
         | 
         | They are then retrieved and injected based on relevance to the
         | query.
        
       | chenster wrote:
       | Thank you! This is exactly something we are look for at querro.io
        
         | saigal wrote:
         | fantastic. let us know how it goes :-)
        
         | saigal wrote:
         | https://discord.com/invite/A59Uxyy2k9
         | 
         | discord invite in case anything comes up
        
       | threeseed wrote:
       | Never understood why I would want to use this over an NLP+ORM
       | system.
       | 
       | At least with that you get 100% accuracy at the expense of having
       | to use a fixed syntax.
        
         | aazo11 wrote:
         | ORMs generally map around entities and dimensions. Users
         | generally ask about metrics and measures, which can be
         | expressed in aggregations and group bys.
         | 
         | How ould the NLP+ORM system do this?
        
       | DeathArrow wrote:
       | I understand that this does better than the average LLM because
       | you can train it using the database structure. But since database
       | structures can change a lot, it might require retraining often.
       | 
       | Is retraining being done automatically after each PR that
       | modifies the DB? Is there a way to inject the DB structure in the
       | context?
        
       | zurfer wrote:
       | That's one of the more feature rich AI analytics assistants. (1)
       | 
       | Kudos for open sourcing. I think it's really difficult to build a
       | business around that, but there are some successful examples in
       | the space: metabase, airbyte, dbt, (maybe databricks?)
       | 
       | (1) https://github.com/Snowboard-Software/awesome-ai-analytics
        
       | evan_ry wrote:
       | This is a historical contribution. Thank you for doing it!
       | 
       | Basically all the enterprises with a lot of data need to "chat
       | with their data" right now.
       | 
       | I can't imagine how many teams are doing similar stuff right now.
        
       | iloveitaly wrote:
       | We open-sourced our text-to-sql product last year too (way more
       | simple than this):
       | 
       | https://github.com/ryanstout/question_to_sql
       | 
       | These sorts of businesses are really hard to build: incumbents
       | have such an advantage. Makes so much more sense for this to be
       | (a) open source (b) tied to snowflake / powerbi that have free
       | distribution and a good security story.
        
         | saigal wrote:
         | We also encounter a lot of build vs buy conversations with
         | businesses.
        
       | devd00d wrote:
       | I struggle to see why chatgpt can't do this already?
        
         | badgersnake wrote:
         | Then you haven't used it very much.
        
           | saigal wrote:
           | Yes totally agree. You can easily sniff out products that are
           | simple wrap of GPT
        
             | yard2010 wrote:
             | Wouldn't a full featured OS GUI be a simple wrap of the
             | command line? Would this make it less valuable to have?
        
               | badgersnake wrote:
               | I think it would make it unusably slow.
        
       | jaynpatel wrote:
       | Perhaps orthogonal problem - imagine you join a new company that
       | has an enterprise product with hundreds of tables. Is there a way
       | to connect Dataherald to my DB, and ask basic questions about the
       | DB? E.g. "where are stored records related to X".
        
         | PartiallyTyped wrote:
         | Dump the schema, create a document for each table, use LLM with
         | rag?
        
       | zainhoda wrote:
       | Would you be interested in merging with Vanna in some way?
       | 
       | You're ahead of us in terms of interface but we're ahead of you
       | in terms of adoption (because of specific choices we've made and
       | partnerships we've done).
        
       | pamelafox wrote:
       | It looks like the supported vector DBs are Pinecone and Astra.
       | Have you looked into Postgres with pgvector? I've started
       | experimenting with building RAG flows for pgvector, works fairly
       | well.
        
         | aazo11 wrote:
         | Right now the supported Vector stores are Chroma (which you can
         | self-host), Pinecone and Astra. Adding a new vector store is
         | quite easy: you just need to extend the VectorStore class (http
         | s://github.com/Dataherald/dataherald/tree/main/services/...)
         | and set it as the Vector store module to be used in the
         | environment variable
         | https://github.com/Dataherald/dataherald/blob/main/services/...
        
       | emmender2 wrote:
       | did data-herald not find usecases or user problems to solve using
       | its tech ?
       | 
       | are any startups applying LLMs profitable at all ? or is it just
       | a mirage - ie, in the real world, startups are not able to solve
       | users problems well using LLMs.
        
       | npsimons wrote:
       | This is awesome! While I'm nowhere _near_ being able to leverage
       | this right now, I am currently going through the painful process
       | of  "databasing" raw documents into SQL, and I can tell you that
       | perhaps the hardest part is getting the schema correct; as you
       | put it "natural language can often be ambiguous". Even worse, is
       | just the squishiness of things never originally intended to be
       | specified for software.
       | 
       | Communication always has been, and continues to be, the hardest
       | part of software development.
        
       | dhanushreddy29 wrote:
       | I built a similar thing with a streamlit interface for an
       | hackathon recently.
       | 
       | https://devpost.com/software/personal-sql-assistant
        
       ___________________________________________________________________
       (page generated 2024-05-25 23:00 UTC)