[HN Gopher] Fine-tuning GPT-3.5-turbo for natural language to SQL
       ___________________________________________________________________
        
       Fine-tuning GPT-3.5-turbo for natural language to SQL
        
       Author : saigal
       Score  : 100 points
       Date   : 2023-08-31 18:26 UTC (4 hours ago)
        
 (HTM) web link (medium.com)
 (TXT) w3m dump (medium.com)
        
       | sandGorgon wrote:
       | do you not use a vector db and embeddings search to get the table
       | structure, etc ?
        
         | aazo11 wrote:
         | Most RAG approaches use a vectorDB and embeddings for schema
         | linking. In this case the fine-tuning is handling schema
         | linking and there is no vectorDB.
        
       | codeulike wrote:
       | I often do complicated reports in SQL or complicated
       | transformations in SQL for system migrations. To really write a
       | query and 'get it right' you usually need insider knowledge that
       | you can't glean from the column names.
       | 
       | I see in their training set they've got comments about columns
       | too. e.g.                   "Country" TEXT NOT NULL, - country
       | where the singer born
       | 
       | But thats still not enough.
       | 
       | You also need a bunch of information about the real business that
       | the data is describing. And you also need to analyse the whole
       | database - is that field actually used? What are the common
       | values for this picklist? What does that status actually mean in
       | terms of business? If there are two of those rows that match the
       | join but I want to avoid duplicates, which one do I take? - the
       | newest, or the one with a certain status? etc.
        
       | philipodonnell wrote:
       | Are you planning on submitting this model to be evaluated against
       | the Spider holdout set?
       | 
       | Also, wondering if anyone has found research on the inverse of
       | this approach to the problem, i.e., instead of training the model
       | to understand the data, you improve the data to be more
       | understandable to a model? This seems more promising when you are
       | looking at enterprise use cases without much training data.
       | Spider seems like quite a simple dataset compared to the ones I
       | encounter on the job, and LLMs struggle even with those.
        
         | aazo11 wrote:
         | In theory one could create domain specific (or industry
         | specific) templates for data. However coming up with a
         | universal structure might be challenging since data is so
         | varied.
         | 
         | Since the issue is often the context, plugging in data
         | dictionaries (and passing those to the LLM) can help
        
         | MrezaPourreza wrote:
         | Yes, we have already submitted the model for evaluation on the
         | Spider holdout test set. While your suggestion is certainly
         | intriguing, implementing a universal solution could be quite
         | challenging, as it would heavily depend on the specifics of the
         | dataset.
        
           | philipodonnell wrote:
           | I don't think it's necessarily about a "universal" solution,
           | just "better". Make the column names more verbose, changing
           | numeric enums to text ones, disambiguating column names, etc.
           | One of the spider datasets is a stadium table and one of the
           | column names is "average", which means average capacity, but
           | it's super ambiguous. If you asked an LLM to "make these
           | table columns more verbose" I bet it would call that
           | "average_capacity" and all of the sudden some NLQ queries
           | that confused the function and the column name would start to
           | work.
        
       | tmostak wrote:
       | It wasn't clear to me what evaluation method was being used, the
       | chart in the blog says Execution Accuracy, but the numbers that
       | seem to be used appear to correlate with "Exact Set Match"
       | (comparing on SQL) instead of the "Execution With Values"
       | (comparing on result set values). For example, DIN-SQL + GPT-4
       | achieves an 85.3% "Execution With Values" score. Is that what is
       | being used here?
       | 
       | See the following for more info:
       | 
       | https://yale-lily.github.io/spider
       | https://github.com/taoyds/spider/tree/master/evaluation_exam...
        
       | bob1029 wrote:
       | We've been chasing this rabbit since the beginning. It currently
       | seems to be uncatchable for the use cases that would be most
       | valuable to us - writing complex queries we've never seen before.
       | Our use case seems to confound any notion of training or fine-
       | tuning, since the cases we need the most help with are also the
       | ones we have the fewest examples of.
       | 
       | Instead of going for generative, few-shot models, I am starting
       | to look at the other end of the spectrum: Binary classification
       | into deterministic query building.
       | 
       | With ChatGPT, you cannot realistically explain why you got some
       | output in a way that anyone other than an AI/ML expert would find
       | satisfying. With binary classifiers, you can precisely explain
       | how some input resulted in some output in terms that a business
       | person could easily digest - "You mentioned this table so it
       | assumed you wanted to constrain on XYZ. Here's the trace from the
       | classifiers...".
       | 
       | I've proposed a few schemes where you define groups of
       | classifiers for each SQL building concern - Which tables are
       | involved, which columns, is a join or aggregate implied, general
       | context of business use, etc. Clearly, there are holes with this
       | scheme, but in our domain we could plausibly fine-tune our
       | _humans_ to be a little bit more verbose in their use of the
       | automagic SQL vending machine. Several hours spent training
       | humans is probably a lot cheaper  & easier than getting ChatGPT,
       | et. al. to _consistently_ play by our rules.
        
         | Nick_Molter wrote:
         | Curious to learn more about your use case. If fine-tuning is
         | only ineffective for your most complex queries (and presumably
         | those are less frequent as well, since you mentioned you have
         | few examples), then couldn't you use fine-tuning to handle the
         | simpler queries (presumably the lion's share) and thus free up
         | excess man hours to focus on the more complex queries? Is there
         | any benefit to AI being able to answer 90% of queries vs 0%?
        
           | bob1029 wrote:
           | These tools are already fantastic at our 80% average case,
           | even without fine tuning. We are seeing _some_ value, but the
           | real pain is in that other 20%.
        
         | sillysaurusx wrote:
         | I think you've fallen into the trap of "AIs don't generalize,
         | they memorize." But they do in fact generalize. The reason
         | ChatGPT is so valuable is precisely because it can help out
         | with situations that have never been seen before, not because
         | it merely unlocks old preexisting knowledge. The fella who
         | saved their dog with ChatGPT comes to mind.
         | https://nypost.com/2023/03/27/chatgpt-saved-my-dogs-life-aft...
        
           | bob1029 wrote:
           | > I think you've fallen into the trap of "AIs don't
           | generalize, they memorize."
           | 
           | Binary classifiers don't generalize?
           | 
           | Just because my _output_ is not generative does not mean we
           | are cannot learn  / generalize elsewhere. Think of it as a
           | 2-stage process.
        
       | h1fra wrote:
       | The cost per question seems super high. I can't even think of an
       | API where a single call would cost $1cent. You better have a good
       | pricing model to follow up on this.
        
         | claytonjy wrote:
         | Google Speech-to-Text is one I've dealt with recently; costs
         | over 1C/ to transcribe 1 minute of audio. Almost 8C//minute for
         | the medical version.
        
           | h1fra wrote:
           | 1C/ per 1minute is fair as it would take the best translator
           | much more time than that. 1C/ per sql query seems less fair
           | to me, but we could argue it would also cost more to ask a
           | dev.
        
             | eli wrote:
             | It would obviously cost much more to pay a person to write
             | SQL for you
        
         | riku_iki wrote:
         | > I can't even think of an API where a single call would cost
         | $1cent.
         | 
         | paying engineer for the same job will be many factors more
         | expensive.
        
         | Zetobal wrote:
         | If your app is able to function better for the cost of the API
         | calls the value might just be there.
        
         | beebmam wrote:
         | Really? I sure can. I can even imagine a single API request
         | that triggers a job that costs $100,000. These APIs exist.
         | They're not for individuals.
        
           | h1fra wrote:
           | I mean yes, I have managed BigQuery jobs that cost thousand
           | of dollars per run but the actual call is not the thing that
           | is expensive and I'm in control of what it costs me. It's not
           | exactly the same scenario imo
        
             | ainesh93 wrote:
             | I think the focus here isn't necessarily on compute cost.
             | When companies hire data scientists or analysts, they're
             | niche-skilled and expensive. If those people spend 50-60%
             | of their time courting ad-hoc questions from various people
             | in the org, the cost of that employee's time (and the money
             | spent on them doing menial tasks that are a waste of their
             | skillset) is the biggest factor.
        
         | IanCal wrote:
         | It's about 0.2 cents right?
         | 
         | It's hard for something you interact with manually to provide
         | positive value less than 0.2c.
        
           | thewataccount wrote:
           | That's for the non-finetuned GPT3.5turbo model.
           | 
           | Finetuned is 1.2cents/1k in and 1.6cents/1k out. So it'll
           | likely be closer to 2cents depending on what you're doing.
           | 
           | I'm not saying it's not useful, at 2c per query you have to
           | be more "purposeful" as they could certainly add up depending
           | on how you use it compared to 0.2c.
        
             | IanCal wrote:
             | Ah thanks I read that wrong.
        
         | Legend2440 wrote:
         | What? Some of the APIs I use at work cost $100+ per call. (it's
         | not really about the call, it's about the data it provides)
        
       | shireboy wrote:
       | I think natural language to SQL may not be as great as it sounds
       | in many real-world applications. You often have permissions,
       | tenants, rules about who-can-see-what. I love the idea of letting
       | users use natural language to query and possibly even bulk-update
       | data. But if the app has rules like those, this would just be a
       | SQL injection vulnerability. You could possibly limit it - just
       | allow query over certain views or something - but the risk of the
       | GPT crafting a dangerous query is still high.
       | 
       | What would be more useful IMO is natural language to OData,
       | GraphQL, and OpenAPI/Swagger. Then you could let users do ad-hoc
       | query but only against data they are allowed. I did a PoC using
       | GPT3 to query OData and it was pretty fun, but did occasionally
       | get the query wrong. I also ran into the context window issue. It
       | would get lost when fed larger OData schema.
        
       | MrezaPourreza wrote:
       | I believe this article underscores the significance and efficacy
       | of fine-tuning for specific tasks. Looking ahead, I envision the
       | integration of fine-tuned models with RAG agents and models,
       | further enhancing overall performance.
        
       | saigal wrote:
       | A tutorial on how to fine-tune a GPT3.5 model for Natural
       | Language to SQL tasks and a comparison of its performance vs
       | Retrieval Augmented Generation. Based on the results, fine-tuning
       | can match and outperform RAG (the approach matches the state of
       | the art on accuracy while being far faster and cheaper). The big
       | challenge for fine-tuning tasks like this is building the
       | training datasets. Things should get even more interesting when
       | GPT-4 is opened for fine-tuning.
        
         | [deleted]
        
       | iandanforth wrote:
       | Spider isn't anything like the queries that analysts and data
       | scientists write against DBs. I don't think it even has many
       | joins. Do you not have access to a more realistic training set?
        
         | ainesh93 wrote:
         | Although Spider is better known in the text-to-SQL world,
         | you're right that BiRD may provide a better testing ground.
         | Comparing against the current leaderboard on that standard is
         | on the docket!
        
         | philipodonnell wrote:
         | Agree that it's not representative of real world queries, but I
         | think it's more of a "measure progress against a simple but
         | consistent baseline" and SOTA is still struggling so it's
         | clearly not ready for real queries. Most of the papers on this
         | topic mention a lack of large volumes of high quality training
         | data... Spider is probably the best public one right now.
        
         | tmostak wrote:
         | I agree that Spider queries are not necessarily representative
         | of the SQL you might see in the wild from real users, but
         | looking at some analysis I did of the dataset around 43% of the
         | queries had joins, and a number had 3, 4, or 5-way joins.
        
       ___________________________________________________________________
       (page generated 2023-08-31 23:00 UTC)