[HN Gopher] Fine-tuning GPT-3.5-turbo for natural language to SQL
___________________________________________________________________
Fine-tuning GPT-3.5-turbo for natural language to SQL
Author : saigal
Score : 100 points
Date : 2023-08-31 18:26 UTC (4 hours ago)
(HTM) web link (medium.com)
(TXT) w3m dump (medium.com)
| sandGorgon wrote:
| do you not use a vector db and embeddings search to get the table
| structure, etc ?
| aazo11 wrote:
| Most RAG approaches use a vectorDB and embeddings for schema
| linking. In this case the fine-tuning is handling schema
| linking and there is no vectorDB.
| codeulike wrote:
| I often do complicated reports in SQL or complicated
| transformations in SQL for system migrations. To really write a
| query and 'get it right' you usually need insider knowledge that
| you can't glean from the column names.
|
| I see in their training set they've got comments about columns
| too. e.g. "Country" TEXT NOT NULL, - country
| where the singer born
|
| But thats still not enough.
|
| You also need a bunch of information about the real business that
| the data is describing. And you also need to analyse the whole
| database - is that field actually used? What are the common
| values for this picklist? What does that status actually mean in
| terms of business? If there are two of those rows that match the
| join but I want to avoid duplicates, which one do I take? - the
| newest, or the one with a certain status? etc.
| philipodonnell wrote:
| Are you planning on submitting this model to be evaluated against
| the Spider holdout set?
|
| Also, wondering if anyone has found research on the inverse of
| this approach to the problem, i.e., instead of training the model
| to understand the data, you improve the data to be more
| understandable to a model? This seems more promising when you are
| looking at enterprise use cases without much training data.
| Spider seems like quite a simple dataset compared to the ones I
| encounter on the job, and LLMs struggle even with those.
| aazo11 wrote:
| In theory one could create domain specific (or industry
| specific) templates for data. However coming up with a
| universal structure might be challenging since data is so
| varied.
|
| Since the issue is often the context, plugging in data
| dictionaries (and passing those to the LLM) can help
| MrezaPourreza wrote:
| Yes, we have already submitted the model for evaluation on the
| Spider holdout test set. While your suggestion is certainly
| intriguing, implementing a universal solution could be quite
| challenging, as it would heavily depend on the specifics of the
| dataset.
| philipodonnell wrote:
| I don't think it's necessarily about a "universal" solution,
| just "better". Make the column names more verbose, changing
| numeric enums to text ones, disambiguating column names, etc.
| One of the spider datasets is a stadium table and one of the
| column names is "average", which means average capacity, but
| it's super ambiguous. If you asked an LLM to "make these
| table columns more verbose" I bet it would call that
| "average_capacity" and all of the sudden some NLQ queries
| that confused the function and the column name would start to
| work.
| tmostak wrote:
| It wasn't clear to me what evaluation method was being used, the
| chart in the blog says Execution Accuracy, but the numbers that
| seem to be used appear to correlate with "Exact Set Match"
| (comparing on SQL) instead of the "Execution With Values"
| (comparing on result set values). For example, DIN-SQL + GPT-4
| achieves an 85.3% "Execution With Values" score. Is that what is
| being used here?
|
| See the following for more info:
|
| https://yale-lily.github.io/spider
| https://github.com/taoyds/spider/tree/master/evaluation_exam...
| bob1029 wrote:
| We've been chasing this rabbit since the beginning. It currently
| seems to be uncatchable for the use cases that would be most
| valuable to us - writing complex queries we've never seen before.
| Our use case seems to confound any notion of training or fine-
| tuning, since the cases we need the most help with are also the
| ones we have the fewest examples of.
|
| Instead of going for generative, few-shot models, I am starting
| to look at the other end of the spectrum: Binary classification
| into deterministic query building.
|
| With ChatGPT, you cannot realistically explain why you got some
| output in a way that anyone other than an AI/ML expert would find
| satisfying. With binary classifiers, you can precisely explain
| how some input resulted in some output in terms that a business
| person could easily digest - "You mentioned this table so it
| assumed you wanted to constrain on XYZ. Here's the trace from the
| classifiers...".
|
| I've proposed a few schemes where you define groups of
| classifiers for each SQL building concern - Which tables are
| involved, which columns, is a join or aggregate implied, general
| context of business use, etc. Clearly, there are holes with this
| scheme, but in our domain we could plausibly fine-tune our
| _humans_ to be a little bit more verbose in their use of the
| automagic SQL vending machine. Several hours spent training
| humans is probably a lot cheaper & easier than getting ChatGPT,
| et. al. to _consistently_ play by our rules.
| Nick_Molter wrote:
| Curious to learn more about your use case. If fine-tuning is
| only ineffective for your most complex queries (and presumably
| those are less frequent as well, since you mentioned you have
| few examples), then couldn't you use fine-tuning to handle the
| simpler queries (presumably the lion's share) and thus free up
| excess man hours to focus on the more complex queries? Is there
| any benefit to AI being able to answer 90% of queries vs 0%?
| bob1029 wrote:
| These tools are already fantastic at our 80% average case,
| even without fine tuning. We are seeing _some_ value, but the
| real pain is in that other 20%.
| sillysaurusx wrote:
| I think you've fallen into the trap of "AIs don't generalize,
| they memorize." But they do in fact generalize. The reason
| ChatGPT is so valuable is precisely because it can help out
| with situations that have never been seen before, not because
| it merely unlocks old preexisting knowledge. The fella who
| saved their dog with ChatGPT comes to mind.
| https://nypost.com/2023/03/27/chatgpt-saved-my-dogs-life-aft...
| bob1029 wrote:
| > I think you've fallen into the trap of "AIs don't
| generalize, they memorize."
|
| Binary classifiers don't generalize?
|
| Just because my _output_ is not generative does not mean we
| are cannot learn / generalize elsewhere. Think of it as a
| 2-stage process.
| h1fra wrote:
| The cost per question seems super high. I can't even think of an
| API where a single call would cost $1cent. You better have a good
| pricing model to follow up on this.
| claytonjy wrote:
| Google Speech-to-Text is one I've dealt with recently; costs
| over 1C/ to transcribe 1 minute of audio. Almost 8C//minute for
| the medical version.
| h1fra wrote:
| 1C/ per 1minute is fair as it would take the best translator
| much more time than that. 1C/ per sql query seems less fair
| to me, but we could argue it would also cost more to ask a
| dev.
| eli wrote:
| It would obviously cost much more to pay a person to write
| SQL for you
| riku_iki wrote:
| > I can't even think of an API where a single call would cost
| $1cent.
|
| paying engineer for the same job will be many factors more
| expensive.
| Zetobal wrote:
| If your app is able to function better for the cost of the API
| calls the value might just be there.
| beebmam wrote:
| Really? I sure can. I can even imagine a single API request
| that triggers a job that costs $100,000. These APIs exist.
| They're not for individuals.
| h1fra wrote:
| I mean yes, I have managed BigQuery jobs that cost thousand
| of dollars per run but the actual call is not the thing that
| is expensive and I'm in control of what it costs me. It's not
| exactly the same scenario imo
| ainesh93 wrote:
| I think the focus here isn't necessarily on compute cost.
| When companies hire data scientists or analysts, they're
| niche-skilled and expensive. If those people spend 50-60%
| of their time courting ad-hoc questions from various people
| in the org, the cost of that employee's time (and the money
| spent on them doing menial tasks that are a waste of their
| skillset) is the biggest factor.
| IanCal wrote:
| It's about 0.2 cents right?
|
| It's hard for something you interact with manually to provide
| positive value less than 0.2c.
| thewataccount wrote:
| That's for the non-finetuned GPT3.5turbo model.
|
| Finetuned is 1.2cents/1k in and 1.6cents/1k out. So it'll
| likely be closer to 2cents depending on what you're doing.
|
| I'm not saying it's not useful, at 2c per query you have to
| be more "purposeful" as they could certainly add up depending
| on how you use it compared to 0.2c.
| IanCal wrote:
| Ah thanks I read that wrong.
| Legend2440 wrote:
| What? Some of the APIs I use at work cost $100+ per call. (it's
| not really about the call, it's about the data it provides)
| shireboy wrote:
| I think natural language to SQL may not be as great as it sounds
| in many real-world applications. You often have permissions,
| tenants, rules about who-can-see-what. I love the idea of letting
| users use natural language to query and possibly even bulk-update
| data. But if the app has rules like those, this would just be a
| SQL injection vulnerability. You could possibly limit it - just
| allow query over certain views or something - but the risk of the
| GPT crafting a dangerous query is still high.
|
| What would be more useful IMO is natural language to OData,
| GraphQL, and OpenAPI/Swagger. Then you could let users do ad-hoc
| query but only against data they are allowed. I did a PoC using
| GPT3 to query OData and it was pretty fun, but did occasionally
| get the query wrong. I also ran into the context window issue. It
| would get lost when fed larger OData schema.
| MrezaPourreza wrote:
| I believe this article underscores the significance and efficacy
| of fine-tuning for specific tasks. Looking ahead, I envision the
| integration of fine-tuned models with RAG agents and models,
| further enhancing overall performance.
| saigal wrote:
| A tutorial on how to fine-tune a GPT3.5 model for Natural
| Language to SQL tasks and a comparison of its performance vs
| Retrieval Augmented Generation. Based on the results, fine-tuning
| can match and outperform RAG (the approach matches the state of
| the art on accuracy while being far faster and cheaper). The big
| challenge for fine-tuning tasks like this is building the
| training datasets. Things should get even more interesting when
| GPT-4 is opened for fine-tuning.
| [deleted]
| iandanforth wrote:
| Spider isn't anything like the queries that analysts and data
| scientists write against DBs. I don't think it even has many
| joins. Do you not have access to a more realistic training set?
| ainesh93 wrote:
| Although Spider is better known in the text-to-SQL world,
| you're right that BiRD may provide a better testing ground.
| Comparing against the current leaderboard on that standard is
| on the docket!
| philipodonnell wrote:
| Agree that it's not representative of real world queries, but I
| think it's more of a "measure progress against a simple but
| consistent baseline" and SOTA is still struggling so it's
| clearly not ready for real queries. Most of the papers on this
| topic mention a lack of large volumes of high quality training
| data... Spider is probably the best public one right now.
| tmostak wrote:
| I agree that Spider queries are not necessarily representative
| of the SQL you might see in the wild from real users, but
| looking at some analysis I did of the dataset around 43% of the
| queries had joins, and a number had 3, 4, or 5-way joins.
___________________________________________________________________
(page generated 2023-08-31 23:00 UTC)