https://medium.com/dataherald/fine-tuning-gpt-3-5-turbo-for-natural-language-to-sql-4445c1d37f7c

 
[                    ]

Fine-tuning GPT-3.5-Turbo for Natural Language to SQL

 
Mo Pourreza
 
Dataherald

Mo Pourreza

*

Follow

Published in
 

Dataherald

*
10 min read
*
7 hours ago

--

Listen

Share

Photo by Mariia Shalabaieva on Unsplash

Background

Allowing non-technical users to ask questions from a database has
been a problem of interest in academia and industry for years. The
recent advances in Large Language Model (LLM) technology, such as
GPT-4, have improved the accuracy of proposed solutions. However,
since the most advanced LLMs have not been open for fine-tuning,
recent work in the space has focused on creating Retrieval-Augmented
Generation (RAG) algorithms that can enable complex Natural Language
to SQL (NL-to-SQL) scenarios without modifying the underlying LLM.

Last week, OpenAI opened up GPT-3.5-turbo for fine-tuning. In this
post, we will fine-tune our own NL-to-SQL model and compare its
performance against the state of the art RAG approach. We will use
the Spider dataset from Yale university as our test benchmark.

Fine-tuning GPT-3.5-Turbo for NL-to-SQL

Like all model training and fine-tuning, the first step of
fine-tuning GPT-3.5-Turbo is the creation and upload of a training
dataset. Since GPT-3.5-Turbo is a ChatModel, this dataset must use to
the following format, and be uploaded as a JSONL file

{"messages": [{"role": "system", "content": "system_prompt"}, {"role": "user", "content": "user_prompt"}, {"role": "assistant", "content": "assistant_prompt"}]}
{"messages": [{"role": "system", "content": "system_prompt"}, {"role": "user", "content": "user_prompt"}, {"role": "assistant", "content": "assistant_prompt"}]}
{"messages": [{"role": "system", "content": "system_prompt"}, {"role": "user", "content": "user_prompt"}, {"role": "assistant", "content": "assistant_prompt"}]}

The Spider dataset has a holdout test set of 2147 question/SQL pairs,
a development set of 1034 question/SQL pairs, and a training set of
7000 question/SQL pairs. We will build our fine-tuning dataset in the
structure above from the Spider training set.

Creating the training dataset

An NL-to-SQL task is defined as follows: given a question and
database, identify a SQL query that when executed against the
database returns a result set that can answer the question. Various
approaches have been explored on how best to prompt LLMs for this
task, and it is generally agreed that the prompt needs to include an
instructional component, details of the database schema, information
about the database's content, a set of task-specific demonstrations
and of course the actual question at hand.

Given the format of the ChatModel training data, the elements above
have to be presented within the following three prompts:

  * system_prompt -- will contain the instruction, database schema and
    database content
  * user_prompt -- will contain the natural language question
  * assistant_prompt -- where the SQL will be provided together with a
    reasoning step

Let's look at how to create each of these for our NL-to-SQL training
dataset.

The system prompt

Creating the system_prompt is by far the most complex part of this
exercise. At a minimum, the system_prompt needs to include:

 1. The system instruction
 2. The DB schema
 3. Information about the DB content

In addition, for any real-world use case with a large number of
tables, the samples in the training set should also train the model
to select the correct tables from the DB for the SQL query (i.e
perform schema-linking).

System Instruction

For the instruction we used the following standard prompt

You are an assistant that is an expert in generating Sqlite SQL queries.
Having the access to database content, generate a correct Sqlite SQL query for the given question.
### Database content ###

Database Schema

In the literature there are many proposed prompt formats for the
database schema and content with no clear consensus around which
performs best. We found the following to be the optimal
representation of the database schema:

CREATE TABLE concert (
"concert_ID" INTEGER NOT NULL,
"concert_Name" TEXT NOT NULL, - the name of the concert
"Theme" TEXT, - theme of the concert
"Stadium_ID" TEXT NOT NULL,
"Year" TEXT, PRIMARY KEY ("concert_ID"),
FOREIGN KEY("Stadium_ID")
REFERENCES stadium ("Stadium_ID")
)
CREATE TABLE singer (
"Singer_ID" INTEGER NOT NULL,
"Name" TEXT, - name of the singer
"Country" TEXT NOT NULL, - country where the singer born
"Song_Name" TEXT NOT NULL, - the name of the song produced by the singer
"Song_release_year" TEXT, - The release year of the song
"Age" INTEGER,
"Is_male" BOOLEAN NOT NULL,
PRIMARY KEY ("Singer_ID")
)

Database Content

After much experimentation we found the following template to perform
the best at training the model about the database content:

/*
Columns in concert and 3 examples in each column for the high cardinality columns :
concert_ID: 1025 , 1101 , 1247
concert_Name : "Fire", "Dance", "Sky"
Stadium_ID : 9, 10, 11
*/
/*
Columns in concert and all categories for the low cardinality columns :
Theme : " ROCK ", " POP ", " HIP-HOP "
Year : 2022, 2021, 2023, 2020
*/
/*
Columns in singer and 3 examples in each column for the high cardinality columns :
Singer_ID : 10235 , 110231 , 1242447
Name : "Jordan", "Gabriel", "Tiffany"
Country : "Iran", "India", "Canada"
Song_Name : "dance in the fire", "rain", "sky"
Age : 19, 20, 21
*/
/*
Columns in singer and all categories for the low cardinality columns :
Is_male : "MALE", "FEMALE",
Song_release_year : 2022, 2021, 2023, 2020
*/

An important element in the database content is how to identify
categorical (low cardinality) columns. The threshold for
distinguishing between low and high cardinality columns depends on
the context window size of the Large Language Model (LLM) being
fine-tuned. Given the 4096 token context window of GPT-3.5-turbo, we
determined 20 tokens as the appropriate threshold between low and
high cardinality columns.

Schema Linking

The final challenge in creating the system_prompts for our training
set is to provide samples in such a way that train the model to
correctly perform schema-linking on the database. To do this, we
employed the following heuristic: for each individual NL <> SQL
sample we included a random selection of other tables from the DB in
addition to the correct tables until we reached the context window
limit of 4000 tokens. To mitigate the influence of positional
information, we further randomized the order of tables. In short,
each system_prompt included the schema and content of the relevant
tables mixed in with other irrelevant tables, helping train the model
in picking the correct tables for the query.

We will now put all of this together to build our system_prompts.

For the sample below from Spider:

Question : "How many heads of the departments are older than 56 ?"
SQL: "SELECT count(*) FROM head WHERE age > 56"

The system_prompt will be

You are an assistant that is an expert in generating Sqlite SQL queries.
Having the access to database content, generate a correct Sqlite SQL query for the given question.
### Database content ###
CREATE TABLE trip (
id INTEGER, duration INTEGER,
start_date TEXT,
start_station_name TEXT,
start_station_id INTEGER,
end_date TEXT,
end_station_name TEXT,
end_station_id INTEGER,
bike_id INTEGER,
subscription_type TEXT,
zip_code INTEGER,
PRIMARY KEY (id)
)
/* Columns in trip and 3 examples in each column for high cardinality columns :
id : 900645, 900752, 900524
duration : 1131, 2146, 1155
start_date : 8/21/2015 17:39, 8/21/2015 17:03, 8/21/2015 17:16
start_station_name : Howard at 2nd, 2nd at Folsom, Market at 10th
start_station_id : 56, 65, 49 end_date : 8/21/2015 17:19, 8/21/2015 18:08, 8/21/2015 17:32
end_station_name : Howard at 2nd, 2nd at Folsom, Market at 10th
end_station_id : 56, 65, 49
bike_id : 586, 56, 65
zip_code : 94070, 94530, 94040-1724
*/
/* Columns in trip and all categories for low cardinality columns :
subscription_type : Customer, Subscriber
*/

CREATE TABLE management (
"department_ID" INTEGER,
"head_ID" INTEGER,
temporary_acting TEXT,
PRIMARY KEY ("department_ID", "head_ID"),
FOREIGN KEY("head_ID") REFERENCES head ("head_ID"),
FOREIGN KEY("department_ID") REFERENCES department ("Department_ID")
)
/* Columns in management and all categories for low cardinality columns :
department_ID : 7, 15, 2, 11
head_ID : 5, 4, 6, 3, 10
temporary_acting : Yes, No
*/

CREATE TABLE department (
"Department_ID" INTEGER,
"Name" TEXT,
"Creation" TEXT,
"Ranking" INTEGER,
"Budget_in_Billions" REAL,
"Num_Employees" REAL,
PRIMARY KEY ("Department_ID")
)
/* Columns in department and 3 examples in each column for high cardinality columns :
Department_ID : 1, 13, 11
Name : Energy, Interior, Health and Human Services
Creation : 1913, 1979, 1989
Ranking : 1, 13, 11
Budget_in_Billions : 10.7, 77.6, 59.7
Num_Employees : 112557.0, 3000000.0, 235000.0
*/

...
CREATE TABLE head (
"head_ID" INTEGER,
name TEXT,
born_state TEXT,
age REAL,
PRIMARY KEY ("head_ID")
)
/* Columns in head and all categories for low cardinality columns :
head_ID : 1, 2, 5, 7, 8, 4, 6, 3, 10, 9
name : Jeff Maggert, Padraig Harrington, Billy Mayfair, K. J. Choi, Dudley Hart, Sergio Garcia, Stewart Cink, Tiger Woods, Nick Faldo, Franklin Langham
born_state : Delaware, Connecticut, Alabama, California, Florida
age : 69.0, 67.0, 68.0, 53.0, 56.0, 52.0, 50.0, 43.0
*/

...

The user prompt

The user prompt is simple, the user question for each sample in
Spider. For example:

How many heads of the departments are older than 56 ?

The assistant prompt

The assistant prompt is also simple, containing the associated SQL
query from Spider and the reasoning step to find the correct column
and correct table for the SQL query. To construct the reasoning step
we simply extracted the tables and columns that are used in the SQL
query. For example:

To construct the query, I'll be working with the following tables: head.
From these tables, I'll be using the following columns: age.
The SQL query I'll be generating is:
SELECT count(*) FROM head WHERE age > 56

Submitting the training set for fine-tuning

Once we have created the JSONL file (you can find a small sample here
), the next step involves uploading the created file to OpenAI using
the following command:

openai.api_key = os.getenv("OPENAI_API_KEY")
print(openai.File.create(file=open("spider-finetuning.jsonl", "rb"),purpose='fine-tune'))

After uploading the file you can check the status of the upload using
the following command:

print(openai.File.retrieve(id="file-id"))
#OR
print(openai.File.list())

The result should be something like this:

{
"object": "file",
"id": "file-id",
"purpose": "fine-tune",
"filename": "file",
"bytes": 71699079,
"created_at": 1693343752,
"status": "uploaded",
"status_details": null
}

When the status has changed to processed (similar to below) you can
use the file for fine-tuning:

{
"object": "file",
"id": "file-id",
"purpose": "fine-tune",
"filename": "file",
"bytes": 71699079,
"created_at": 1693343752,
"status": "processed",
"status_details": null
}

Now, we are ready to start the fine-tuning job. To create a
fine-tuning job you can use the following python code:

print(openai.FineTuningJob.create(
training_file="file-id",
model="gpt-3.5-turbo",
suffix = "spider",
hyperparameters = {
"n_epochs": #number_of_epochs,
})
)

The duration of the fine-tuning process will vary depending on the
size of the fine-tuning dataset. There is a maximum token limit for
fine-tuning, which is set at 50,000,000 tokens. Therefore, when
working with the Spider dataset, we reduced the number of samples
from 7,000 to 5,750 and conducted fine-tuning for a total of 2
epochs.

You can check the status of the fine-tuning job using the following
command:

print(openai.FineTuningJob.retrieve(id="ftjob-id"))

The result should be something like this:

{
"object": "fine_tuning.job",
"id": "ftjob-id",
"model": "gpt-3.5-turbo-0613",
"created_at": 1693346245,
"finished_at": 1693353313,
"fine_tuned_model": "ft:gpt-3.5-turbo-0613:dataherald:spider:id",
"organization_id": "org-id",
"result_files": [
"file-id"
],
"status": "succeeded",
"validation_file": null,
"training_file": "file-id",
"hyperparameters": {
"n_epochs": 2
},
"trained_tokens": 44722020
}

Model Performance

We benchmarked the performance of the fine-tuned model against
GPT3.5-Turbo without fine-tuning and DIN-SQL + GPT-4 (the current
state of the art on Spider) for zero-shot performance.

The results are as follows

Performance of the fine-tuned GPT-3.5-Turbo against previous methods.

Fine-tuning GPT-3.5-Turbo yielded a performance improvement of nearly
11 percent brining its accuracy in line with the DIN-SQL + GPT-4, the
current state-of-the-art approach which uses GPT-4 and employs
various advanced prompting techniques, including few-shot prompting,
chain-of-thought prompting and decomposed prompting.

Critically, the fine-tuned model significantly reduces both cost and
processing time when compared to the DIN-SQL + GPT-4 approach. The
table below provides an approximate cost and speed of difference
between the models per single question from Spider.

Cost and speed of different models per question from Spider benchmark

As demonstrated above, the cost of the fine-tuned GPT-3.5-Turbo model
is 30 times less than DIN-SQL with GPT-4 and it is 12 times faster.

Conclusion and Next Steps

The results from the experiment are clear: with an initial investment
of time and money to build a training dataset the state of the art
can be matched in accuracy, while being 12 times faster and 30 times
cheaper.

Fine-tuning is a powerful tool in the NL-2-SQL arsenal. However it is
not a silver bullet as few organizations have NL-to-SQL training
datasets readily available. It is our belief that the best
architectures will combine fine-tuned models together with RAG
agents. With the anticipated launch of GPT-4 fine-tuning, we expect
progress in the field to accelerate further and finally unlock
question-answering from structured data for all businesses.

In the next post we will show how to plug in the fine-tuned model
above into the Dataherald engine and deploy it in a real world
scenario.

If you are interested in NL-2-SQL discussions you can join our
Discord server. If you want to allow non-technical users to ask
questions from your company's data warehouse please join our waitlist
.

References

DIN-SQL paper: https://arxiv.org/abs/2304.11015

NL-to-SQL useful papers:

How to Prompt LLMs for Text-to-SQL: https://arxiv.org/abs/2305.11853

Divide and Prompt: https://arxiv.org/abs/2304.11556

Exploring Chain-of-Thought Style Prompting for Text-to-SQL: https://
arxiv.org/abs/2305.14215

A comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL
capability: https://arxiv.org/abs/2303.13547

--

--

 
Mo Pourreza
 
Dataherald
Follow
 

Written by Mo Pourreza

13 Followers
*Editor for
 

Dataherald

My email: pourreza@ualberta.ca

Follow
 

More from Mo Pourreza and Dataherald

 
Evaluating LLM generated SQL
 
Mo Pourreza
 

Mo Pourreza

in

 

Dataherald

 

Evaluating LLM generated SQL

An analysis of current approaches

 
8 min read*Jul 10

--

 

1

 
How to connect LLM to SQL database with LangChain SQLChain
 
Dishen Wang
 

Dishen Wang

in

 

Dataherald

 

How to connect LLM to SQL database with LangChain SQLChain

How to Tutorial for using LangChain SQLChain

 
6 min read*Jun 15

--

 

4

 
How to connect LLM to SQL database with LlamaIndex
 
Dishen Wang
 

Dishen Wang

in

 

Dataherald

 

How to connect LLM to SQL database with LlamaIndex

How to Tutorial for using LlamaIndex for text-to-SQL

 
5 min read*Jun 30

--

 
 
How to connect LLM to SQL database with LangChain SQLAgent
 
Dishen Wang
 

Dishen Wang

in

 

Dataherald

 

How to connect LLM to SQL database with LangChain SQLAgent

How to Tutorial for using LangChain SQLAgent

 
6 min read*Jun 21

--

 

1

 
See all from Mo Pourreza
 
See all from Dataherald

Recommended from Medium

 
Python in Excel Will Reshape How Data Analysts Work
 
The PyCoach
 

The PyCoach

in

 

Artificial Corner

 

Python in Excel Will Reshape How Data Analysts Work

Microsoft just announced Python in Excel. Here's how it'll change the
way Python and Excel analysts work.

 
*5 min read*Aug 24

--

 

26

 
Introduction to LLMs and the generative AI : Part 1- LLM
Architecture, Prompt Engineering and LLM...
 
Yash Bhaskar
 

Yash Bhaskar

 

Introduction to LLMs and the generative AI : Part 1- LLM
Architecture, Prompt Engineering and LLM...

Large language models (LLMs) have revolutionized the field of
artificial intelligence (AI) development, offering developers
unprecedented...

 
11 min read*Jul 16

--

 

Lists

 
[0]
[1]
[1]

Staff Picks

421 stories*248 saves
 
[1]
[1]
[1]

Stories to Help You Level-Up at Work

19 stories*194 saves
 
[1]
[1]
[1]

Self-Improvement 101

20 stories*480 saves
 
[1]
[1]
[1]

Productivity 101

20 stories*472 saves
 
Mastering SQL Aggregating Functions and Grouping
 
Brandon Wohlwend
 

Brandon Wohlwend

 

Mastering SQL Aggregating Functions and Grouping

Welcome back, fellow data explorers! In our previous adventure
through the enchanting world of SQL statements, we uncovered the
magic of...

 
16 min read*Aug 2

--

 

1

 
Make ChatGPT Stop Chatting and Start Writing SQL
 
Daniel Liden
 

Daniel Liden

in

 

The Inner Join

 

Make ChatGPT Stop Chatting and Start Writing SQL

You are a natural language to SQL code translator. Your role is to
translate text to SQL. Return only SQL. Do not include...

 
8 min read*Mar 8

--

 

1

 
3 Things I Learned Right After Joining Microsoft as an Azure Support
Engineer
 
Arwin Lashawn
 

Arwin Lashawn

 

3 Things I Learned Right After Joining Microsoft as an Azure Support
Engineer

I joined Microsoft (NASDAQ: MSFT) back in July 2022 and ever since
then I have been consciously keeping track of what I have been
learning...

 
*5 min read*Apr 9

--

 
 
Effectively querying any CSV with python, ChatGPT and langchain
 
Anwesh Mishra
 

Anwesh Mishra

in

 

Dev Genius

 

Effectively querying any CSV with python, ChatGPT and langchain

Each and every organization who claim to be a "tech" company nowadays
need some kind of data analysis to stay ahead in their game. Whether...

 
5 min read*Jul 5

--

 
 
See more recommendations
 

Help

 

Status

 

Writers

 

Blog

 

Careers

 

Privacy

 

Terms

 

About

 

Text to speech

 

Teams