https://gvwilson.github.io/sql-tutorial/

Home
repo download license conduct contributing colophon

SQL for Data Scientists in 100 Queries

 what this is

  * notes and working examples that instructors can use to perform a
    lesson
      + do not expect novices with no prior SQL experience to be able
        to learn from them
  * musical analogy
      + this is the chord changes and melody
      + we expect instructors to create an arrangement and/or
        improvise while delivering
      + see Teaching Tech Together for background

 scope

  * intended audience
      + Rachel has a master's degree in cell biology and now works in
        a research hospital doing cell assays.
      + She learned a bit of R in an undergrad biostatistics course
        and has been through the Carpentries lesson on the Unix shell
        .
      + Rachel is thinking about becoming a data scientist and would
        like to understand how data is stored and managed.
      + Her work schedule is unpredictable and highly variable, so
        she needs to be able to learn a bit at a time.
  * prerequisites
      + basic Unix command line: cd, ls, * wildcard
      + basic tabular data analysis: filtering rows, aggregating
        within groups
  * learning outcomes
     1. Explain the difference between a database and a database
        manager.
     2. Write SQL to select, filter, sort, group, and aggregate data.
     3. Define tables and insert, update, and delete records.
     4. Describe different types of join and write queries that use
        them to combine data.
     5. Use windowing functions to operate on adjacent rows.
     6. Explain what transactions are and write queries that roll
        back when constraints are violated.
     7. Explain what triggers are and write SQL to create them.
     8. Manipulate JSON data using SQL.
     9. Interact with a database using Python directly, from a
        Jupyter notebook, and via an ORM.

 setup

  * Download the latest release
  * Unzip the file in a temporary directory to create:
      + ./db/*.db: the SQLite databases used in the examples
      + ./src/*.*: SQL queries, Python scripts, and other source code
      + ./out/*.*: expected output for examples

 background concepts

  * A database is a collection of data that can be searched and
    retrieved
  * A database management system (DBMS) is a program that manages a
    particular kind of database
  * Each DBMS stores data in its own way
      + SQLite stores each database in a single file
      + PostgreSQL spreads information across many files for higher
        performance
  * DBMS can be a library embedded in other programs (SQLite) or a
    server (PostgreSQL)
  * A relational database management system (RDBMS) stores data in
    tables and uses SQL for queries
      + Unfortunately, every RDBMS has its own dialect of SQL
  * There are also NoSQL databases like MongoDB that don't use tables

concept map: overview

 connect to database

src/connect_penguins.sh

sqlite3 data/penguins.db

  * Not actually a query
  * But we have to do it before we can do anything else

1: select constant

src/select_1.sql

select 1;

out/select_1.out

1

  * select is a keyword
  * Normally used to select data from table...
  * ...but if all we want is a constant value, we don't need to specify
    one
  * Semi-colon terminator is required

2: select all values from table

src/select_star.sql

select * from little_penguins;

out/select_star.out

Adelie|Dream|37.2|18.1|178|3900|MALE
Adelie|Dream|37.6|19.3|181|3300|FEMALE
Gentoo|Biscoe|50|15.3|220|5550|MALE
Adelie|Torgersen|37.3|20.5|199|3775|MALE
Adelie|Biscoe|39.6|17.7|186|3500|FEMALE
Gentoo|Biscoe|47.7|15|216|4750|FEMALE
Adelie|Dream|36.5|18|182|3150|FEMALE
Gentoo|Biscoe|42|13.5|210|4150|FEMALE
Adelie|Torgersen|42.1|19.1|195|4000|MALE
Gentoo|Biscoe|54.3|15.7|231|5650|MALE

  * An actual query
  * Use * to mean "all columns"
  * Use from tablename to specify table
  * Output format is not particularly readable

 administrative commands

src/admin_commands.sql

.headers on
.mode markdown
select * from little_penguins;

out/admin_commands.out

| species |  island   | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g |  sex   |
|---------|-----------|----------------|---------------|-------------------|-------------|--------|
| Adelie  | Dream     | 37.2           | 18.1          | 178               | 3900        | MALE   |
| Adelie  | Dream     | 37.6           | 19.3          | 181               | 3300        | FEMALE |
| Gentoo  | Biscoe    | 50             | 15.3          | 220               | 5550        | MALE   |
| Adelie  | Torgersen | 37.3           | 20.5          | 199               | 3775        | MALE   |
| Adelie  | Biscoe    | 39.6           | 17.7          | 186               | 3500        | FEMALE |
| Gentoo  | Biscoe    | 47.7           | 15            | 216               | 4750        | FEMALE |
| Adelie  | Dream     | 36.5           | 18            | 182               | 3150        | FEMALE |
| Gentoo  | Biscoe    | 42             | 13.5          | 210               | 4150        | FEMALE |
| Adelie  | Torgersen | 42.1           | 19.1          | 195               | 4000        | MALE   |
| Gentoo  | Biscoe    | 54.3           | 15.7          | 231               | 5650        | MALE   |

  * SQLite administrative commands start with . and aren't part of
    the SQL standard
      + PostgreSQL's special commands start with \
  * Use .help for a complete list

3: specify columns

src/specify_columns.sql

select species, island, sex
from little_penguins;

out/specify_columns.out

| species |  island   |  sex   |
|---------|-----------|--------|
| Adelie  | Dream     | MALE   |
| Adelie  | Dream     | FEMALE |
| Gentoo  | Biscoe    | MALE   |
| Adelie  | Torgersen | MALE   |
| Adelie  | Biscoe    | FEMALE |
| Gentoo  | Biscoe    | FEMALE |
| Adelie  | Dream     | FEMALE |
| Gentoo  | Biscoe    | FEMALE |
| Adelie  | Torgersen | MALE   |
| Gentoo  | Biscoe    | MALE   |

  * Specify column names separated by commas
      + In any order
      + Duplicates allowed
  * Line breaks [S:allowed:S] encouraged for readability

4: sort

src/sort.sql

select species, sex, island
from little_penguins
order by island asc, sex desc;

out/sort.out

| species |  sex   |  island   |
|---------|--------|-----------|
| Gentoo  | MALE   | Biscoe    |
| Gentoo  | MALE   | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Gentoo  | FEMALE | Biscoe    |
| Gentoo  | FEMALE | Biscoe    |
| Adelie  | MALE   | Dream     |
| Adelie  | FEMALE | Dream     |
| Adelie  | FEMALE | Dream     |
| Adelie  | MALE   | Torgersen |
| Adelie  | MALE   | Torgersen |

  * order by must follow from (which must follow select)
  * asc is ascending, desc is descending
      + Default is ascending, but please specify

5: limit output

  * Full dataset has 344 rows

src/limit.sql

select species, sex, island
from penguins
order by species, sex, island
limit 10;

out/limit.out

| species |  sex   |  island   |
|---------|--------|-----------|
| Adelie  |        | Dream     |
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |

  * Comments start with -- and run to the end of the line
  * limit N specifies maximum number of rows returned by query

6: page output

src/page.sql

select species, sex, island
from penguins
order by species, sex, island
limit 10 offset 3;

out/page.out

| species |  sex   |  island   |
|---------|--------|-----------|
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  |        | Torgersen |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |
| Adelie  | FEMALE | Biscoe    |

  * offset N must follow limit
  * Specifies number of rows to skip from the start of the selection
  * So this query skips the first 3 and shows the next 10

7: remove duplicates

src/distinct.sql

select distinct species, sex, island
from penguins;

out/distinct.out

|  species  |  sex   |  island   |
|-----------|--------|-----------|
| Adelie    | MALE   | Torgersen |
| Adelie    | FEMALE | Torgersen |
| Adelie    |        | Torgersen |
| Adelie    | FEMALE | Biscoe    |
| Adelie    | MALE   | Biscoe    |
| Adelie    | FEMALE | Dream     |
| Adelie    | MALE   | Dream     |
| Adelie    |        | Dream     |
| Chinstrap | FEMALE | Dream     |
| Chinstrap | MALE   | Dream     |
| Gentoo    | FEMALE | Biscoe    |
| Gentoo    | MALE   | Biscoe    |
| Gentoo    |        | Biscoe    |

  * distinct keyword must appear right after select
      + SQL was supposed to read like English
  * Shows distinct combinations
  * Blanks in sex column show missing data
      + We'll talk about this in a bit

8: filter results

src/filter.sql

select distinct species, sex, island
from penguins
where island = 'Biscoe';

out/filter.out

| species |  sex   | island |
|---------|--------|--------|
| Adelie  | FEMALE | Biscoe |
| Adelie  | MALE   | Biscoe |
| Gentoo  | FEMALE | Biscoe |
| Gentoo  | MALE   | Biscoe |
| Gentoo  |        | Biscoe |

  * where condition filters the rows produced by selection
  * Condition is evaluated independently for each row
  * Only rows that pass the test appear in results
  * Use single quotes for 'text data' and double quotes for "weird
    column names"
      + SQLite will accept double-quoted text data

9: filter with more complex conditions

src/filter_and.sql

select distinct species, sex, island
from penguins
where island = 'Biscoe' and sex != 'MALE';

out/filter_and.out

| species |  sex   | island |
|---------|--------|--------|
| Adelie  | FEMALE | Biscoe |
| Gentoo  | FEMALE | Biscoe |

  * and: both sub-conditions must be true
  * or: either or both part must be true
  * Notice that the row for Gentoo penguins on Biscoe island with
    unknown (empty) sex didn't pass the test
      + We'll talk about this in a bit

10: do calculations

src/calculations.sql

select
    flipper_length_mm / 10.0,
    body_mass_g / 1000.0
from penguins
limit 3;

out/calculations.out

| flipper_length_mm / 10.0 | body_mass_g / 1000.0 |
|--------------------------|----------------------|
| 18.1                     | 3.75                 |
| 18.6                     | 3.8                  |
| 19.5                     | 3.25                 |

  * Can do the usual kinds of arithmetic on individual values
      + Calculation done for each row independently
  * Column name shows the calculation done

11: rename columns

src/rename_columns.sql

select
    flipper_length_mm / 10.0 as flipper_cm,
    body_mass_g / 1000.0 as weight_kg,
    island as where_found
from penguins
limit 3;

out/rename_columns.out

| flipper_cm | weight_kg | where_found |
|------------|-----------|-------------|
| 18.1       | 3.75      | Torgersen   |
| 18.6       | 3.8       | Torgersen   |
| 19.5       | 3.25      | Torgersen   |

  * Use expression as name to rename
  * Give result of calculation a meaningful name
  * Can also rename columns without modifying

 check your understanding

concept map: selection

12: calculate with missing values

src/show_missing_values.sql

select
    flipper_length_mm / 10.0 as flipper_cm,
    body_mass_g / 1000.0 as weight_kg,
    island as where_found
from penguins
limit 5;

out/show_missing_values.out

| flipper_cm | weight_kg | where_found |
|------------|-----------|-------------|
| 18.1       | 3.75      | Torgersen   |
| 18.6       | 3.8       | Torgersen   |
| 19.5       | 3.25      | Torgersen   |
|            |           | Torgersen   |
| 19.3       | 3.45      | Torgersen   |

  * SQL uses a special value null to representing missing data
      + Not 0 or empty string, but "I don't know"
  * Flipper length and body weight not known for one of the first
    five penguins
  * "I don't know" divided by 10 or 1000 is "I don't know"

13: null equality

  * Repeated from above so it doesn't count against our query limit

src/filter.sql

select distinct species, sex, island
from penguins
where island = 'Biscoe';

out/filter.out

| species |  sex   | island |
|---------|--------|--------|
| Adelie  | FEMALE | Biscoe |
| Adelie  | MALE   | Biscoe |
| Gentoo  | FEMALE | Biscoe |
| Gentoo  | MALE   | Biscoe |
| Gentoo  |        | Biscoe |

  * If we ask for female penguins the row with the missing sex drops
    out

src/null_equality.sql

select distinct species, sex, island
from penguins
where island = 'Biscoe' and sex = 'FEMALE';

out/null_equality.out

| species |  sex   | island |
|---------|--------|--------|
| Adelie  | FEMALE | Biscoe |
| Gentoo  | FEMALE | Biscoe |

14: null inequality

  * But if we ask for penguins that aren't female it drops out as
    well

src/null_inequality.sql

select distinct species, sex, island
from penguins
where island = 'Biscoe' and sex != 'FEMALE';

out/null_inequality.out

| species | sex  | island |
|---------|------|--------|
| Adelie  | MALE | Biscoe |
| Gentoo  | MALE | Biscoe |

15: ternary logic

src/ternary_logic.sql

select null = null;

out/ternary_logic.out

| null = null |
|-------------|
|             |

  * If we don't know the left and right values, we don't know if
    they're equal or not
  * So the result is null
  * Get the same answer for null != null
  * Ternary logic

      equality
       X     Y   null
 X   true  false null
 Y   false true  null
null null  null  null

16: handle null safely

src/safe_null_equality.sql

select species, sex, island
from penguins
where sex is null;

out/safe_null_equality.out

| species | sex |  island   |
|---------|-----|-----------|
| Adelie  |     | Torgersen |
| Adelie  |     | Torgersen |
| Adelie  |     | Torgersen |
| Adelie  |     | Torgersen |
| Adelie  |     | Torgersen |
| Adelie  |     | Dream     |
| Gentoo  |     | Biscoe    |
| Gentoo  |     | Biscoe    |
| Gentoo  |     | Biscoe    |
| Gentoo  |     | Biscoe    |
| Gentoo  |     | Biscoe    |

  * Use is null and is not null to handle null safely
  * Other parts of SQL handle nulls specially

 check your understanding

concept map: null

17: aggregate

src/simple_sum.sql

select sum(body_mass_g) as total_mass
from penguins;

out/simple_sum.out

| total_mass |
|------------|
| 1437000    |

  * Aggregation combines many values to produce one
  * sum is an aggregation function
  * Combines corresponding values from multiple rows

18: common aggregation functions

src/common_aggregations.sql

select
    max(bill_length_mm) as longest_bill,
    min(flipper_length_mm) as shortest_flipper,
    avg(bill_length_mm) / avg(bill_depth_mm) as weird_ratio
from penguins;

out/common_aggregations.out

| longest_bill | shortest_flipper |   weird_ratio    |
|--------------|------------------|------------------|
| 59.6         | 172              | 2.56087082530644 |

  * This actually shouldn't work: can't calculate maximum or average
    if any values are null
  * SQL does the useful thing instead of the right one

19: counting

src/count_behavior.sql

select count(*) as count_star, count(sex) as count_specific
from penguins;

out/count_behavior.out

| count_star | count_specific |
|------------|----------------|
| 344        | 333            |

  * count(*) counts rows
  * count(column) counts non-null entries in column

20: group

src/simple_group.sql

select
    avg(body_mass_g) as average_mass_g
from penguins
group by sex;

out/simple_group.out

|  average_mass_g  |
|------------------|
| 4005.55555555556 |
| 3862.27272727273 |
| 4545.68452380952 |

  * Put rows in groups based on distinct combinations of values in
    columns specified with group by
  * Then perform aggregation separately for each group
  * But which is which?

21: behavior of unaggregated columns

src/unaggregated_columns.sql

select
    sex,
    avg(body_mass_g) as average_mass_g
from penguins
group by sex;

out/unaggregated_columns.out

|  sex   |  average_mass_g  |
|--------|------------------|
|        | 4005.55555555556 |
| FEMALE | 3862.27272727273 |
| MALE   | 4545.68452380952 |

  * All rows in each group have the same value for sex, so no need to
    aggregate

22: arbitrary choice in aggregation

src/arbitrary_in_aggregation.sql

select
    sex,
    body_mass_g
from penguins
group by sex;

out/arbitrary_in_aggregation.out

|  sex   | body_mass_g |
|--------|-------------|
|        |             |
| FEMALE | 3800        |
| MALE   | 3750        |

  * If we don't specify how to aggregate a column, SQL can choose any
    arbitrary value from the group
  * All penguins in each group have the same sex because we grouped
    by that, so we get the right answer
  * The body mass values are in the data but unpredictable
  * A common mistake

23: filter aggregated values

src/filter_aggregation.sql

select
    sex,
    avg(body_mass_g) as average_mass_g
from penguins
group by sex
having average_mass_g > 4000.0;

out/filter_aggregation.out

| sex  |  average_mass_g  |
|------|------------------|
|      | 4005.55555555556 |
| MALE | 4545.68452380952 |

  * Using having condition instead of where condition for aggregates

24: readable output

src/readable_aggregation.sql

select
    sex,
    round(avg(body_mass_g), 1) as average_mass_g
from penguins
group by sex
having average_mass_g > 4000.0;

out/readable_aggregation.out

| sex  | average_mass_g |
|------|----------------|
|      | 4005.6         |
| MALE | 4545.7         |

  * Use round(value, decimals) to round off a number

25: filter aggregate inputs

src/filter_aggregate_inputs.sql

select
    sex,
    round(
        avg(body_mass_g) filter (where body_mass_g < 4000.0),
        1
    ) as average_mass_g
from penguins
group by sex;

out/filter_aggregate_inputs.out

|  sex   | average_mass_g |
|--------|----------------|
|        | 3362.5         |
| FEMALE | 3417.3         |
| MALE   | 3752.5         |

  * filter (where condition) applies to inputs

 check your understanding

concept map: aggregation

 create in-memory database

src/in_memory_db.sh

sqlite3 :memory:

  * "Connect" to an in-memory database

26: create tables

src/create_work_job.sql

create table job (
    name text not null,
    billable real not null
);
create table work (
    person text not null,
    job text not null
);

  * create table name followed by parenthesized list of columns
  * Each column is a name, a data type, and optional extra
    information
      + E.g., not null prevents nulls from being added
  * .schema is not standard SQL
  * SQLite has added a few things
      + create if not exists
      + upper-case keywords (SQL is case insensitive)

27: insert data

src/populate_work_job.sql

insert into job values
    ('calibrate', 1.5),
    ('clean', 0.5);
insert into work values
    ('mik', 'calibrate'),
    ('mik', 'clean'),
    ('mik', 'complain'),
    ('po', 'clean'),
    ('po', 'complain'),
    ('tay', 'complain');

out/insert_values.out

|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |
| clean     | 0.5      |

| person |    job    |
|--------|-----------|
| mik    | calibrate |
| mik    | clean     |
| mik    | complain  |
| po     | clean     |
| po     | complain  |
| tay    | complain  |

28: update rows

src/update_work_job.sql

update work
set person = "tae"
where person = "tay";

out/update_rows.out

| person |    job    |
|--------|-----------|
| mik    | calibrate |
| mik    | clean     |
| mik    | complain  |
| po     | clean     |
| po     | complain  |
| tae    | complain  |

  * (Almost) always specify row(w) to update using where
      + Would otherwise update all rows
  * Useful to give each row a primary key that uniquely identifies it
    for this purpose
      + Will see other uses below

29: delete rows

src/delete_rows.sql

delete from work
where person = "tae";

select * from work;

out/delete_rows.out

| person |    job    |
|--------|-----------|
| mik    | calibrate |
| mik    | clean     |
| mik    | complain  |
| po     | clean     |
| po     | complain  |

  * Again, (almost) always specify row(s) to delete using where

30: backing up

src/backing_up.sql

create table backup (
    person text not null,
    job text not null
);

insert into backup
select person, job
from work
where person = 'tae';

delete from work
where person = 'tae';

select * from backup;

out/backing_up.out

| person |   job    |
|--------|----------|
| tae    | complain |

 check your understanding

concept map: data definition and modification

31: join tables

src/cross_join.sql

select *
from work cross join job;

out/cross_join.out

| person |    job    |   name    | billable |
|--------|-----------|-----------|----------|
| mik    | calibrate | calibrate | 1.5      |
| mik    | calibrate | clean     | 0.5      |
| mik    | clean     | calibrate | 1.5      |
| mik    | clean     | clean     | 0.5      |
| mik    | complain  | calibrate | 1.5      |
| mik    | complain  | clean     | 0.5      |
| po     | clean     | calibrate | 1.5      |
| po     | clean     | clean     | 0.5      |
| po     | complain  | calibrate | 1.5      |
| po     | complain  | clean     | 0.5      |
| tay    | complain  | calibrate | 1.5      |
| tay    | complain  | clean     | 0.5      |

  * A join combines information from two tables
  * full outer join (also called cross join) constructs their cross
    product
      + All combinations of rows from each
  * Result isn't particularly useful: job and name don't match

32: inner join

src/inner_join.sql

select *
from work inner join job
on work.job = job.name;

out/inner_join.out

| person |    job    |   name    | billable |
|--------|-----------|-----------|----------|
| mik    | calibrate | calibrate | 1.5      |
| mik    | clean     | clean     | 0.5      |
| po     | clean     | clean     | 0.5      |

  * Use table.column notation to specify columns
      + A column can have the same name as a table
  * Use on condition to specify join condition
  * Since complain doesn't appear in job.name, none of those rows are
    kept

33: aggregate joined data

src/aggregate_join.sql

select
    work.person,
    sum(job.billable) as pay
from work inner join job
on work.job = job.name
group by work.person;

out/aggregate_join.out

| person | pay |
|--------|-----|
| mik    | 2.0 |
| po     | 0.5 |

  * Combines ideas we've seen before
  * But Tay is missing from the table

34: left join

src/left_join.sql

select *
from work left join job
on work.job = job.name;

out/left_join.out

| person |    job    |   name    | billable |
|--------|-----------|-----------|----------|
| mik    | calibrate | calibrate | 1.5      |
| mik    | clean     | clean     | 0.5      |
| mik    | complain  |           |          |
| po     | clean     | clean     | 0.5      |
| po     | complain  |           |          |
| tay    | complain  |           |          |

  * A left outer join keeps all rows from the left table
  * Fills missing values from right table with null

35: aggregate left joins

src/aggregate_left_join.sql

select
    work.person,
    sum(job.billable) as pay
from work left join job
on work.job = job.name
group by work.person;

out/aggregate_left_join.out

| person | pay |
|--------|-----|
| mik    | 2.0 |
| po     | 0.5 |
| tay    |     |

  * That's better, but we'd like to see 0 rather than a blank

 check your understanding

concept map: join

36: coalesce values

src/coalesce.sql

select
    work.person,
    coalesce(sum(job.billable), 0.0) as pay
from work left join job
on work.job = job.name
group by work.person;

out/coalesce.out

| person | pay |
|--------|-----|
| mik    | 2.0 |
| po     | 0.5 |
| tay    | 0.0 |

  * coalesce(val1, val2, ...) returns first non-null value

37: negate incorrectly

  * Who doesn't calibrate?

src/negate_incorrectly.sql

select distinct person
from work
where job != 'calibrate';

out/negate_incorrectly.out

| person |
|--------|
| mik    |
| po     |
| tay    |

  * But Mik does calibrate
  * Problem is that there's an entry for Mik cleaning
  * And since 'clean' != 'calibrate', that row is included in the
    results
  * We need a different approach

38: set membership

src/set_membership.sql

select *
from work
where person not in ('mik', 'tay');

out/set_membership.out

| person |   job    |
|--------|----------|
| po     | clean    |
| po     | complain |

  * in values and not in values do exactly what you expect

39: subqueries

src/subquery_set.sql

select distinct person
from work
where person not in (
    select distinct person
    from work
    where job = 'calibrate'
);

out/subquery_set.out

| person |
|--------|
| po     |
| tay    |

  * Use a subquery to select the people who do calibrate
  * Then select all the people who aren't in that set
  * Initially feels odd, but subqueries are useful in other ways

 M to N relationships

  * Relationships between entities are usually characterized as:
      + 1-to-1: fields in the same record
      + 1-to-many: the many have a foreign key referring to the one's
        primary key
      + many-to-many: don't know how many keys to add to records
        ("maximum" never is)
  * Nearly-universal solution is a join table
      + Each record is a pair of foreign keys
      + I.e., each record is the fact that records A and B are
        related

40: autoincrement and primary key

src/autoincrement.sql

create table person (
    ident integer primary key autoincrement,
    name text not null
);
insert into person values
    (null, 'mik'),
    (null, 'po'),
    (null, 'tay');
select * from person;
insert into person values (1, "prevented");

out/autoincrement.out

| ident | name |
|-------|------|
| 1     | mik  |
| 2     | po   |
| 3     | tay  |
Runtime error near line 12: UNIQUE constraint failed: person.ident (19)

  * Database autoincrements ident each time a new record is added
  * Use that field as the primary key
      + So that if Mik changes their name again, we only have to
        change one fact in the database
      + Downside: manual queries are harder to read (who is person
        17?)

 internal tables

src/sequence_table.sql

select * from sqlite_sequence;

out/sequence_table.out

|  name  | seq |
|--------|-----|
| person | 3   |

  * Sequence numbers are not reset when rows are deleted

41: alter tables

src/alter_tables.sql

alter table job
add ident integer not null default -1;

update job
set ident = 1
where name = 'calibrate';

update job
set ident = 2
where name = 'clean';

select * from job;

out/alter_tables.out

|   name    | billable | ident |
|-----------|----------|-------|
| calibrate | 1.5      | 1     |
| clean     | 0.5      | 2     |

  * Add a column after the fact
  * Since it can't be null, we have to provide a default value
      + Really want to make it the primary key, but SQLite doesn't
        allow that (easily) after the fact
  * Then use update to modify existing records
      + Can modify any number of records at once
      + So be careful about where clause
  * Data migration

42: create new tables from old

src/insert_select.sql

create table new_work (
    person_id integer not null,
    job_id integer not null,
    foreign key(person_id) references person(ident),
    foreign key(job_id) references job(ident)
);

insert into new_work
select
    person.ident as person_id,
    job.ident as job_id
from
    (person join work on person.name = work.person)
    join job on job.name = work.job;
select * from new_work;

out/insert_select.out

| person_id | job_id |
|-----------|--------|
| 1         | 1      |
| 1         | 2      |
| 2         | 2      |

  * new_work is our join table
  * Each column refers to a record in some other table

43: remove tables

src/drop_table.sql

drop table work;
alter table new_work rename to work;

out/drop_table.out

CREATE TABLE job (
    ident integer primary key autoincrement,
    name text not null,
    billable real not null
);
CREATE TABLE sqlite_sequence(name,seq);
CREATE TABLE person (
    ident integer primary key autoincrement,
    name text not null
);
CREATE TABLE IF NOT EXISTS "work" (
    person_id integer not null,
    job_id integer not null,
    foreign key(person_id) references person(ident),
    foreign key(job_id) references job(ident)
);

  * Remove the old table and rename the new one to take its place
      + Note if exists
  * Be careful...

44: compare individual values to aggregates

  * Go back to penguins

src/compare_individual_aggregate.sql

select body_mass_g
from penguins
where body_mass_g > (
    select avg(body_mass_g)
    from penguins
)
limit 5;

out/compare_individual_aggregate.out

| body_mass_g |
|-------------|
| 4675        |
| 4250        |
| 4400        |
| 4500        |
| 4650        |

  * Get average body mass in subquery
  * Compare each row against that
  * Requires two scans of the data, but there's no way to avoid that
  * Null values aren't included in the average or in the final
    results

45: compare individual values to aggregates within groups

src/compare_within_groups.sql

select
    penguins.species,
    penguins.body_mass_g,
    round(averaged.avg_mass_g, 1) as avg_mass_g
from penguins join (
    select species, avg(body_mass_g) as avg_mass_g
    from penguins
    group by species
) as averaged
on penguins.species = averaged.species
where penguins.body_mass_g > averaged.avg_mass_g
limit 5;

out/compare_within_groups.out

| species | body_mass_g | avg_mass_g |
|---------|-------------|------------|
| Adelie  | 3750        | 3700.7     |
| Adelie  | 3800        | 3700.7     |
| Adelie  | 4675        | 3700.7     |
| Adelie  | 4250        | 3700.7     |
| Adelie  | 3800        | 3700.7     |

  * Subquery runs first to create temporary table averaged with
    average mass per species
  * Join that with penguins
  * Filter to find penguins heavier than average within their species

46: common table expressions

src/common_table_expressions.sql

with grouped as (
    select species, avg(body_mass_g) as avg_mass_g
    from penguins
    group by species
)
select
    penguins.species,
    penguins.body_mass_g,
    round(grouped.avg_mass_g, 1) as avg_mass_g
from penguins join grouped
where penguins.body_mass_g > grouped.avg_mass_g
limit 5;

out/common_table_expressions.out

| species | body_mass_g | avg_mass_g |
|---------|-------------|------------|
| Adelie  | 3750        | 3700.7     |
| Adelie  | 3800        | 3700.7     |
| Adelie  | 4675        | 3700.7     |
| Adelie  | 4250        | 3700.7     |
| Adelie  | 3800        | 3700.7     |

  * Use common table expression (CTE) to make queries clearer
      + Nested subqueries quickly become difficult to understand
  * Database decides how to optimize

 explain query plan

src/explain_query_plan.sql

explain query plan
select
    species,
    avg(body_mass_g)
from penguins
group by species;

out/explain_query_plan.out

QUERY PLAN
|--SCAN penguins
`--USE TEMP B-TREE FOR GROUP BY

  * SQLite plans to scan every row of the table
  * It will build a temporary B-tree data structure to group rows

47: enumerate rows

  * Every table has a special column called rowid

src/rowid.sql

select rowid, species, island
from penguins
limit 5;

out/rowid.out

| rowid | species |  island   |
|-------|---------|-----------|
| 1     | Adelie  | Torgersen |
| 2     | Adelie  | Torgersen |
| 3     | Adelie  | Torgersen |
| 4     | Adelie  | Torgersen |
| 5     | Adelie  | Torgersen |

  * rowid is persistent within a session
      + I.e., if we delete the first 5 rows we now have row IDs 6...N
  * Do not rely on row ID
      + In particular, do not use it as a key

48: if-else function

src/if_else.sql

with sized_penguins as (
    select
        species,
        iif(
            body_mass_g < 3500,
            'small',
            'large'
        ) as size
    from penguins
)
select species, size, count(*) as num
from sized_penguins
group by species, size
order by species, num;

out/if_else.out

|  species  | size  | num |
|-----------|-------|-----|
| Adelie    | small | 54  |
| Adelie    | large | 98  |
| Chinstrap | small | 17  |
| Chinstrap | large | 51  |
| Gentoo    | large | 124 |

  * iif(condition, true_result, false_result)
      + Note: iif with two i's

49: select a case

  * What if we want small, medium, and large?
  * Can nest iif, but quickly becomes unreadable

src/case_when.sql

with sized_penguins as (
    select
        species,
        case
            when body_mass_g < 3500 then 'small'
            when body_mass_g < 5000 then 'medium'
            else 'large'
        end as size
    from penguins
)
select species, size, count(*) as num
from sized_penguins
group by species, size
order by species, num;

out/case_when.out

|  species  |  size  | num |
|-----------|--------|-----|
| Adelie    | large  | 1   |
| Adelie    | small  | 54  |
| Adelie    | medium | 97  |
| Chinstrap | small  | 17  |
| Chinstrap | medium | 51  |
| Gentoo    | medium | 56  |
| Gentoo    | large  | 68  |

  * Evaluate when options in order and take first
  * Result of case is null if no condition is true
  * Use else as fallback

50: check range

src/check_range.sql

with sized_penguins as (
    select
        species,
        case
            when body_mass_g between 3500 and 5000 then 'normal'
            else 'abnormal'
        end as size
    from penguins
)
select species, size, count(*) as num
from sized_penguins
group by species, size
order by species, num;

out/check_range.out

|  species  |   size   | num |
|-----------|----------|-----|
| Adelie    | abnormal | 55  |
| Adelie    | normal   | 97  |
| Chinstrap | abnormal | 17  |
| Chinstrap | normal   | 51  |
| Gentoo    | abnormal | 62  |
| Gentoo    | normal   | 62  |

  * between can make queries easier to read
  * But be careful of the and in the middle

 yet another database

  * Entity-relationship diagram (ER diagram) shows relationships
    between tables
  * Like everything to do with databases, there are lots of
    variations

assay database table diagram

assay ER diagram

src/assay_staff.sql

select * from staff;

out/assay_staff.out

| ident | personal |  family   | dept | age |
|-------|----------|-----------|------|-----|
| 1     | Kartik   | Gupta     |      | 46  |
| 2     | Divit    | Dhaliwal  | hist | 34  |
| 3     | Indrans  | Sridhar   | mb   | 47  |
| 4     | Pranay   | Khanna    | mb   | 51  |
| 5     | Riaan    | Dua       |      | 23  |
| 6     | Vedika   | Rout      | hist | 45  |
| 7     | Abram    | Chokshi   | gen  | 23  |
| 8     | Romil    | Kapoor    | hist | 38  |
| 9     | Ishaan   | Ramaswamy | mb   | 35  |
| 10    | Nitya    | Lal       | gen  | 52  |

51: pattern matching

src/like_glob.sql

select personal, family from staff
where personal like '%ya%' or family glob '*De*';

out/like_glob.out

| personal | family |
|----------|--------|
| Nitya    | Lal    |

  * like is the original SQL pattern matcher
      + % matches zero or more characters at the start or end of a
        string
      + Case insensitive by default
  * glob supports Unix-style wildcards

 name                              purpose
substr  Get substring given starting point and length
trim    Remove characters from beginning and end of string
ltrim   Remove characters from beginning of string
rtrim   Remove characters from end of string
length  Length of string
replace Replace occurrences of substring with another string
upper   Return upper-case version of string
lower   Return lower-case version of string
instr   Find location of first occurrence of substring (returns 0 if
        not found)

52: select first and last rows

src/union_all.sql

select * from (
    select * from (select * from experiment order by started asc limit 5)
    union all
    select * from (select * from experiment order by started desc limit 5)
)
order by started asc
;

out/union_all.out

| ident |    kind     |  started   |   ended    |
|-------|-------------|------------|------------|
| 17    | trial       | 2023-01-29 | 2023-01-30 |
| 35    | calibration | 2023-01-30 | 2023-01-30 |
| 36    | trial       | 2023-02-02 | 2023-02-03 |
| 25    | trial       | 2023-02-12 | 2023-02-14 |
| 2     | calibration | 2023-02-14 | 2023-02-14 |
| 40    | calibration | 2024-01-21 | 2024-01-21 |
| 12    | trial       | 2024-01-26 | 2024-01-28 |
| 44    | trial       | 2024-01-27 | 2024-01-29 |
| 34    | trial       | 2024-02-01 | 2024-02-02 |
| 14    | calibration | 2024-02-03 | 2024-02-03 |

  * union all combines records
      + Keeps duplicates: union on its own keeps unique records
  * Yes, it feels like the extra select * from should be unnecessary

53: intersection

src/intersect.sql

select personal, family, dept, age
from staff
where dept = 'mb'
intersect
    select personal, family, dept, age from staff
    where age < 50
;

out/intersect.out

| personal |  family   | dept | age |
|----------|-----------|------|-----|
| Indrans  | Sridhar   | mb   | 47  |
| Ishaan   | Ramaswamy | mb   | 35  |

  * Tables being intersected must have same structure
  * Intersection usually used when pulling values from different
    tables
      + In this case, would be clearer to use where

54: exclusion

src/except.sql

select personal, family, dept, age
from staff
where dept = 'mb'
except
    select personal, family, dept, age from staff
    where age < 50
;

out/except.out

| personal | family | dept | age |
|----------|--------|------|-----|
| Pranay   | Khanna | mb   | 51  |

  * Again, tables must have same structure
      + And this would be clearer with where
  * SQL operates on sets, not tables, except where it doesn't

55: random numbers and why not

src/random_numbers.sql

with decorated as (
    select random() as rand,
    personal || ' ' || family as name
    from staff
)
select rand, abs(rand) % 10 as selector, name
from decorated
where selector < 5;

out/random_numbers.out

|         rand         | selector |      name       |
|----------------------|----------|-----------------|
| 7176652035743196310  | 0        | Divit Dhaliwal  |
| -2243654635505630380 | 2        | Indrans Sridhar |
| -6940074802089166303 | 5        | Pranay Khanna   |
| 8882650891091088193  | 9        | Riaan Dua       |
| -45079732302991538   | 5        | Vedika Rout     |
| -8973877087806386134 | 2        | Abram Chokshi   |
| 3360598450426870356  | 9        | Romil Kapoor    |

  * There is no way to seed SQLite's random number generator
  * Which means there is no way to reproduce one of its "random"
    sequences

56: creating index

src/create_use_index.sql

explain query plan
select filename
from plate
where filename like '%07%';

create index plate_file on plate(filename);

explain query plan
select filename
from plate
where filename like '%07%';

out/create_use_index.out

QUERY PLAN
`--SCAN plate USING COVERING INDEX sqlite_autoindex_plate_1
QUERY PLAN
`--SCAN plate USING COVERING INDEX plate_file

  * An index is an auxiliary data structure that enables faster
    access to records
      + Spend storage space to buy speed
  * Don't have to mention it explicitly in queries
      + Database manager will use it automatically

57: generate sequence

src/generate_sequence.sql

select value from generate_series(1, 5);

out/generate_sequence.out

| value |
|-------|
| 1     |
| 2     |
| 3     |
| 4     |
| 5     |

  * A (non-standard) table-valued function

58: generate sequence based on data

src/data_range_sequence.sql

create table temp (
    num integer not null
);
insert into temp values (1), (5);
select value from generate_series (
    (select min(num) from temp),
    (select max(num) from temp)
);

out/data_range_sequence.out

| value |
|-------|
| 1     |
| 2     |
| 3     |
| 4     |
| 5     |

  * Must have the parentheses around the min and max selections to
    keep SQLite happy

59: generate sequence of dates

src/date_sequence.sql

select
    date((select julianday(min(started)) from experiment) + value) as some_day
from (
    select value from generate_series(
        (select 0),
        (select count(*) - 1 from experiment)
    )
)
limit 5;

out/date_sequence.out

|  some_day  |
|------------|
| 2023-01-29 |
| 2023-01-30 |
| 2023-01-31 |
| 2023-02-01 |
| 2023-02-02 |

  * SQLite represents dates as YYYY-MM-DD strings or as Julian days
    or as Unix milliseconds or...
      + Julian days is fractional number of days since November 24,
        4714 BCE
  * julianday and date convert back and forth

60: count experiments started per day without gaps

src/experiments_per_day.sql

with
-- complete sequence of days with 0 as placeholder for number of experiments
all_days as (
    select
        date((select julianday(min(started)) from experiment) + value) as some_day,
        0 as zeroes
    from (
        select value from generate_series(
            (select 0),
            (select count(*) - 1 from experiment)
        )
    )
),
-- sequence of actual days with actual number of experiments started
actual_days as (
    select started, count(started) as num_exp
    from experiment
    group by started
)
-- combined by joining on day and taking actual number (if available) or zero
select
    all_days.some_day as day,
    coalesce(actual_days.num_exp, all_days.zeroes) as num_exp
from
    all_days left join actual_days on all_days.some_day = actual_days.started
limit 5
;

out/experiments_per_day.out

|    day     | num_exp |
|------------|---------|
| 2023-01-29 | 1       |
| 2023-01-30 | 1       |
| 2023-01-31 | 0       |
| 2023-02-01 | 0       |
| 2023-02-02 | 1       |

61: self join

src/self_join.sql

with person as (
    select
        ident,
        personal || ' ' || family as name
    from staff
)
select left.name, right.name
from person as left join person as right
limit 10;

out/self_join.out

|     name     |       name       |
|--------------|------------------|
| Kartik Gupta | Kartik Gupta     |
| Kartik Gupta | Divit Dhaliwal   |
| Kartik Gupta | Indrans Sridhar  |
| Kartik Gupta | Pranay Khanna    |
| Kartik Gupta | Riaan Dua        |
| Kartik Gupta | Vedika Rout      |
| Kartik Gupta | Abram Chokshi    |
| Kartik Gupta | Romil Kapoor     |
| Kartik Gupta | Ishaan Ramaswamy |
| Kartik Gupta | Nitya Lal        |

  * Join a table to itself
      + Use as to create aliases for copies of tables to distinguish
        them
      + Nothing special about the names left and right
  * Get all n^2 pairs, including person with themself

62: generate unique pairs

src/unique_pairs.sql

with person as (
    select
        ident,
        personal || ' ' || family as name
    from staff
)
select left.name, right.name
from person as left join person as right
on left.ident < right.ident
where left.ident <= 4 and right.ident <= 4;

out/unique_pairs.out

|      name       |      name       |
|-----------------|-----------------|
| Kartik Gupta    | Divit Dhaliwal  |
| Kartik Gupta    | Indrans Sridhar |
| Kartik Gupta    | Pranay Khanna   |
| Divit Dhaliwal  | Indrans Sridhar |
| Divit Dhaliwal  | Pranay Khanna   |
| Indrans Sridhar | Pranay Khanna   |

  * left.ident < right.ident ensures distinct pairs without
    duplicates
  * Use left.ident <= 4 and right.ident <= 4 to limit output
  * Quick check: n(n-1)/2 pairs

63: filter pairs

src/filter_pairs.sql

with
person as (
    select
        ident,
        personal || ' ' || family as name
    from staff
),
together as (
    select
        left.staff as left_staff,
        right.staff as right_staff
    from performed as left join performed as right
    on left.experiment = right.experiment
    where left_staff < right_staff
)
select
    left.name as person_1,
    right.name as person_2
from person as left join person as right join together
on left.ident = left_staff and right.ident = right_staff;

out/filter_pairs.out

|    person_1     |     person_2     |
|-----------------|------------------|
| Kartik Gupta    | Vedika Rout      |
| Pranay Khanna   | Vedika Rout      |
| Indrans Sridhar | Romil Kapoor     |
| Abram Chokshi   | Ishaan Ramaswamy |
| Pranay Khanna   | Vedika Rout      |
| Kartik Gupta    | Abram Chokshi    |
| Abram Chokshi   | Romil Kapoor     |
| Kartik Gupta    | Divit Dhaliwal   |
| Divit Dhaliwal  | Abram Chokshi    |
| Pranay Khanna   | Ishaan Ramaswamy |
| Indrans Sridhar | Romil Kapoor     |
| Kartik Gupta    | Ishaan Ramaswamy |
| Kartik Gupta    | Nitya Lal        |
| Kartik Gupta    | Abram Chokshi    |
| Pranay Khanna   | Romil Kapoor     |

64: existence and correlated subqueries

src/correlated_subquery.sql

select name, building
from department
where exists (
    select 1
    from staff
    where dept = department.ident
)
order by name;

out/correlated_subquery.out

|       name        |     building     |
|-------------------|------------------|
| Genetics          | Chesson          |
| Histology         | Fashet Extension |
| Molecular Biology | Chesson          |

  * Nobody works in Endocrinology
  * select 1 could equally be select true or any other value
  * A correlated subquery depends on a value from the outer query
      + Equivalent to nested loop

65: nonexistence

src/nonexistence.sql

select name, building
from department
where not exists (
    select 1
    from staff
    where dept = department.ident
)
order by name;

out/nonexistence.out

|     name      | building |
|---------------|----------|
| Endocrinology | TGVH     |

 avoiding correlated subqueries

src/avoid_correlated_subqueries.sql

select distinct
    department.name as name,
    department.building as building
from department join staff
on department.ident = staff.dept
order by name;

out/avoid_correlated_subqueries.out

|       name        |     building     |
|-------------------|------------------|
| Genetics          | Chesson          |
| Histology         | Fashet Extension |
| Molecular Biology | Chesson          |

  * The join might or might not be faster than the correlated
    subquery
  * Hard to find unstaffed departments without either not exists or
    count and a check for 0

66: lead and lag

src/lead_lag.sql

with ym_num as (
    select
        strftime('%Y-%m', started) as ym,
        count(*) as num
    from experiment
    group by ym
)
select
    ym,
    lag(num) over (order by ym) as prev_num,
    num,
    lead(num) over (order by ym) as next_num
from ym_num
order by ym;

out/lead_lag.out

|   ym    | prev_num | num | next_num |
|---------|----------|-----|----------|
| 2023-01 |          | 2   | 5        |
| 2023-02 | 2        | 5   | 5        |
| 2023-03 | 5        | 5   | 1        |
| 2023-04 | 5        | 1   | 6        |
| 2023-05 | 1        | 6   | 5        |
| 2023-06 | 6        | 5   | 3        |
| 2023-07 | 5        | 3   | 2        |
| 2023-08 | 3        | 2   | 4        |
| 2023-09 | 2        | 4   | 6        |
| 2023-10 | 4        | 6   | 4        |
| 2023-12 | 6        | 4   | 5        |
| 2024-01 | 4        | 5   | 2        |
| 2024-02 | 5        | 2   |          |

  * Use strftime to extract year and month
      + Clumsy, but date/time handling is not SQLite's strong point
  * Use window functions lead and lag to shift values
      + Unavailable values are null

67: window functions

src/window_functions.sql

with ym_num as (
    select
        strftime('%Y-%m', started) as ym,
        count(*) as num
    from experiment
    group by ym
)
select
    ym,
    num,
    sum(num) over (order by ym) as num_done,
    cume_dist() over (order by ym) as progress
from ym_num
order by ym;

out/window_functions.out

|   ym    | num | num_done |      progress      |
|---------|-----|----------|--------------------|
| 2023-01 | 2   | 2        | 0.0769230769230769 |
| 2023-02 | 5   | 7        | 0.153846153846154  |
| 2023-03 | 5   | 12       | 0.230769230769231  |
| 2023-04 | 1   | 13       | 0.307692307692308  |
| 2023-05 | 6   | 19       | 0.384615384615385  |
| 2023-06 | 5   | 24       | 0.461538461538462  |
| 2023-07 | 3   | 27       | 0.538461538461538  |
| 2023-08 | 2   | 29       | 0.615384615384615  |
| 2023-09 | 4   | 33       | 0.692307692307692  |
| 2023-10 | 6   | 39       | 0.769230769230769  |
| 2023-12 | 4   | 43       | 0.846153846153846  |
| 2024-01 | 5   | 48       | 0.923076923076923  |
| 2024-02 | 2   | 50       | 1.0                |

  * sum() over does a running total
  * cume_dist is fraction of rows seen so far

 explain another query plain

src/explain_window_function.sql

explain query plan
with ym_num as (
    select
        strftime('%Y-%m', started) as ym,
        count(*) as num
    from experiment
    group by ym
)
select
    ym,
    num,
    sum(num) over (order by ym) as num_done,
    cume_dist() over (order by ym) as progress
from ym_num
order by ym;

out/explain_window_function.out

QUERY PLAN
|--CO-ROUTINE (subquery-3)
|  |--CO-ROUTINE (subquery-4)
|  |  |--CO-ROUTINE ym_num
|  |  |  |--SCAN experiment
|  |  |  `--USE TEMP B-TREE FOR GROUP BY
|  |  |--SCAN ym_num
|  |  `--USE TEMP B-TREE FOR ORDER BY
|  `--SCAN (subquery-4)
`--SCAN (subquery-3)

  * Becomes useful...eventually

68: partitioned windows

src/partition_window.sql

with y_m_num as (
    select
        strftime('%Y', started) as year,
        strftime('%m', started) as month,
        count(*) as num
    from experiment
    group by year, month
)
select
    year,
    month,
    num,
    sum(num) over (partition by year order by month) as num_done
from y_m_num
order by year, month;

out/partition_window.out

| year | month | num | num_done |
|------|-------|-----|----------|
| 2023 | 01    | 2   | 2        |
| 2023 | 02    | 5   | 7        |
| 2023 | 03    | 5   | 12       |
| 2023 | 04    | 1   | 13       |
| 2023 | 05    | 6   | 19       |
| 2023 | 06    | 5   | 24       |
| 2023 | 07    | 3   | 27       |
| 2023 | 08    | 2   | 29       |
| 2023 | 09    | 4   | 33       |
| 2023 | 10    | 6   | 39       |
| 2023 | 12    | 4   | 43       |
| 2024 | 01    | 5   | 5        |
| 2024 | 02    | 2   | 7        |

  * partition by creates groups
  * So this counts experiments started since the beginning of each
    year

69: blobs

src/blob.sql

create table images (
    name text not null,
    content blob
);

insert into images(name, content) values
    ("biohazard", readfile("img/biohazard.png")),
    ("crush", readfile("img/crush.png")),
    ("fire", readfile("img/fire.png")),
    ("radioactive", readfile("img/radioactive.png")),
    ("tripping", readfile("img/tripping.png"));

out/blob.out

|    name     | length(content) |
|-------------|-----------------|
| biohazard   | 19629           |
| crush       | 15967           |
| fire        | 18699           |
| radioactive | 16661           |
| tripping    | 17208           |

  * A blob is a binary large object
      + Bytes in, bytes out...
  * If you think that's odd, check out Fossil

 yet another database

src/lab_log_db.sh

sqlite3 data/lab_log.db

src/lab_log_schema.sql

.schema

out/lab_log_schema.out

CREATE TABLE sqlite_sequence(name,seq);
CREATE TABLE person(
       ident            integer primary key autoincrement,
       details          text not null
);
CREATE TABLE machine(
       ident            integer primary key autoincrement,
       name             text not null,
       details          text not null
);
CREATE TABLE usage(
       ident            integer primary key autoincrement,
       log              text not null
);

70: store JSON

src/json_in_table.sql

select * from machine;

out/json_in_table.out

| ident |      name      |                         details                         |
|-------|----------------|---------------------------------------------------------|
| 1     | WY401          | {"acquired": "2023-05-01"}                              |
| 2     | Inphormex      | {"acquired": "2021-07-15", "refurbished": "2023-10-22"} |
| 3     | AutoPlate 9000 | {"note": "needs software update"}                       |

  * Store heterogeneous data as JSON-formatted text (with
    double-quoted strings)
      + Database parses it each time it is queried
  * Alternatively store as blob
      + Can't just view it
      + But more efficient

71: select field from JSON

src/json_field.sql

select
    details->'$.acquired' as single_arrow,
    details->>'$.acquired' as double_arrow
from machine;

out/json_field.out

| single_arrow | double_arrow |
|--------------|--------------|
| "2023-05-01" | 2023-05-01   |
| "2021-07-15" | 2021-07-15   |
|              |              |

  * Single arrow -> returns JSON representation result
  * Double arrow ->> returns SQL text, integer, real, or null
  * Left side is column
  * Right side is path expression
      + Start with $ (meaning "root")
      + Fields separated by .

72: JSON array access

src/json_array.sql

select
    ident,
    json_array_length(log->'$') as length,
    log->'$[0]' as first
from usage;

out/json_array.out

| ident | length |                            first                             |
|-------|--------|--------------------------------------------------------------|
| 1     | 4      | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 2     | 5      | {"machine":"Inphormex","person":["Marianne","Richer"]}       |
| 3     | 2      | {"machine":"sterilizer","person":["Josette","Villeneuve"]}   |
| 4     | 1      | {"machine":"sterilizer","person":["Maude","Goulet"]}         |
| 5     | 2      | {"machine":"AutoPlate 9000","person":["Brigitte","Michaud"]} |
| 6     | 1      | {"machine":"sterilizer","person":["Marianne","Richer"]}      |
| 7     | 3      | {"machine":"WY401","person":["Maude","Goulet"]}              |
| 8     | 1      | {"machine":"AutoPlate 9000"}                                 |

  * SQLite (and other database managers) has lots of JSON
    manipulation functions
  * json_array_length gives number of elements in selected array
  * subscripts start with 0
  * Characters outside 7-bit ASCII represented as Unicode escapes

73: unpack JSON array

src/json_unpack.sql

select
    ident,
    json_each.key as key,
    json_each.value as value
from usage, json_each(usage.log)
limit 10;

out/json_unpack.out

| ident | key |                            value                             |
|-------|-----|--------------------------------------------------------------|
| 1     | 0   | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 1     | 1   | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 1     | 2   | {"machine":"WY401","person":["Gabrielle","Dub\u00e9"]}       |
| 1     | 3   | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]}   |
| 2     | 0   | {"machine":"Inphormex","person":["Marianne","Richer"]}       |
| 2     | 1   | {"machine":"AutoPlate 9000","person":["Marianne","Richer"]}  |
| 2     | 2   | {"machine":"sterilizer","person":["Marianne","Richer"]}      |
| 2     | 3   | {"machine":"AutoPlate 9000","person":["Monique","Marcotte"]} |
| 2     | 4   | {"machine":"sterilizer","person":["Marianne","Richer"]}      |
| 3     | 0   | {"machine":"sterilizer","person":["Josette","Villeneuve"]}   |

  * json_each is another table-valued function
  * Use json_each.name to get properties of unpacked array

74: last element of array

src/json_array_last.sql

select
    ident,
    log->'$[#-1].machine' as final
from usage
limit 5;

out/json_array_last.out

| ident |    final     |
|-------|--------------|
| 1     | "Inphormex"  |
| 2     | "sterilizer" |
| 3     | "Inphormex"  |
| 4     | "sterilizer" |
| 5     | "sterilizer" |

75: modify JSON

src/json_modify.sql

select
    ident,
    name,
    json_set(details, '$.sold', json_quote('2024-01-25')) as updated
from machine;

out/json_modify.out

| ident |      name      |                           updated                            |
|-------|----------------|--------------------------------------------------------------|
| 1     | WY401          | {"acquired":"2023-05-01","sold":"2024-01-25"}                |
| 2     | Inphormex      | {"acquired":"2021-07-15","refurbished":"2023-10-22","sold":" |
|       |                | 2024-01-25"}                                                 |
| 3     | AutoPlate 9000 | {"note":"needs software update","sold":"2024-01-25"}         |

  * Updates the in-memory copy of the JSON, not the database record
  * Please use json_quote rather than trying to format JSON with
    string operations

 refresh penguins

src/count_penguins.sql

select species, count(*) as num
from penguins
group by species;

out/count_penguins.out

|  species  | num |
|-----------|-----|
| Adelie    | 152 |
| Chinstrap | 68  |
| Gentoo    | 124 |

  * We will restore full database after each example

76: tombstones

src/make_active.sql

alter table penguins
add active integer not null default 1;

update penguins
set active = iif(species = 'Adelie', 0, 1);

src/active_penguins.sql

select species, count(*) as num
from penguins
where active
group by species;

out/active_penguins.out

|  species  | num |
|-----------|-----|
| Chinstrap | 68  |
| Gentoo    | 124 |

  * Use a tombstone to mark (in)active records
  * Every query must now include it

77: views

src/views.sql

create view if not exists
active_penguins (
    species,
    island,
    bill_length_mm,
    bill_depth_mm,
    flipper_length_mm,
    body_mass_g,
    sex
) as
select
    species,
    island,
    bill_length_mm,
    bill_depth_mm,
    flipper_length_mm,
    body_mass_g,
    sex
from penguins
where active;

select species, count(*) as num
from active_penguins
group by species;

out/views.out

|  species  | num |
|-----------|-----|
| Chinstrap | 68  |
| Gentoo    | 124 |

  * A view is a saved query that other queries can invoke
  * View is re-run each time it's used
  * Like a CTE, but:
      + Can be shared between queries
      + Views came first
  * Some databases offer materialized views
      + Update-on-demand temporary tables

 hours reminder

src/all_jobs.sql

create table job (
    name text not null,
    billable real not null
);
insert into job values
    ('calibrate', 1.5),
    ('clean', 0.5);
select * from job;

out/all_jobs.out

|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |
| clean     | 0.5      |

78: add check

src/all_jobs_check.sql

create table job (
    name text not null,
    billable real not null,
    check (billable > 0.0)
);
insert into job values ('calibrate', 1.5);
insert into job values ('reset', -0.5);
select * from job;

out/all_jobs_check.out

Runtime error near line 9: CHECK constraint failed: billable > 0.0 (19)
|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |

  * check adds constraint to table
      + Must produce a Boolean result
      + Run each time values added or modified
  * But changes made before the error have taken effect

 ACID

  * Atomic: change cannot be broken down into smaller ones (i.e., all
    or nothing)
  * Consistent: database goes from one consistent state to another
  * Isolated: looks like changes happened one after another
  * Durable: if change takes place, it's still there after a restart

79: transactions

src/transaction.sql

create table job (
    name text not null,
    billable real not null,
    check (billable > 0.0)
);

insert into job values ('calibrate', 1.5);

begin transaction;
insert into job values ('clean', 0.5);
rollback;

select * from job;

out/transaction.out

|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |

  * Statements outside transaction execute and are committed
    immediately
  * Statement(s) inside transaction don't take effect until:
      + end transaction (success)
      + rollback (undo)
  * Can have any number of statements inside a transaction
  * But cannot nest transactions in SQLite
      + Other databases support this

80: rollback in constraint

src/rollback_constraint.sql

create table job (
    name text not null,
    billable real not null,
    check (billable > 0.0) on conflict rollback
);

insert into job values
    ('calibrate', 1.5);
insert into job values
    ('clean', 0.5),
    ('reset', -0.5);

select * from job;

out/rollback_constraint.out

Runtime error near line 11: CHECK constraint failed: billable > 0.0 (19)
|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |

  * All of second insert rolled back as soon as error occurred
  * But first insert took effect

81: rollback in statement

src/rollback_statement.sql

create table job (
    name text not null,
    billable real not null,
    check (billable > 0.0)
);

insert or rollback into job values
    ('calibrate', 1.5);
insert or rollback into job values
    ('clean', 0.5),
    ('reset', -0.5);

select * from job;

out/rollback_statement.out

Runtime error near line 11: CHECK constraint failed: billable > 0.0 (19)
|   name    | billable |
|-----------|----------|
| calibrate | 1.5      |

  * Constraint is in table definition
  * Action is in statement

82: upsert

src/upsert.sql

create table jobs_done (
    person text unique,
    num integer default 0
);

insert into jobs_done values("zia", 1);
.print "after first"
select * from jobs_done;
.print


insert into jobs_done values("zia", 1);
.print "after failed"
select * from jobs_done;

insert into jobs_done values("zia", 1)
    on conflict(person) do update set num = num + 1;
.print "\nafter upsert"
select * from jobs_done;

out/upsert.out

after first
| person | num |
|--------|-----|
| zia    | 1   |

Runtime error near line 14: UNIQUE constraint failed: jobs_done.person (19)
after failed
| person | num |
|--------|-----|
| zia    | 1   |

after upsert
| person | num |
|--------|-----|
| zia    | 2   |

  * upsert stands for "update or insert"
      + Create if record doesn't exist
      + Update if it does
  * Not standard SQL but widely implemented
  * Example also shows use of SQLite .print command

 normalization

  * First normal form (1NF): every field of every record contains one
    indivisible value.

  * Second normal form (2NF) and third normal form (3NF): every value
    in a record that isn't a key depends solely on the key, not on
    other values.

  * Denormalization: explicitly store values that could be calculated
    on the fly

      + To simplify queries and/or make processing faster

83: create trigger

  * A trigger automatically runs before or after a specified
    operation
  * Can have side effects (e.g., update some other table)
  * And/or implement checks (e.g., make sure other records exist)
  * Add processing overhead...
  * ...but data is either cheap or correct, never both
  * Inside trigger, refer to old and new versions of record as old.
    column and new.column

src/trigger_setup.sql

-- Track hours of lab work.
create table job (
    person text not null,
    reported real not null check (reported >= 0.0)
);

-- Explicitly store per-person total rather than using sum().
create table total (
    person text unique not null,
    hours real
);

-- Initialize totals.
insert into total values
    ("gene", 0.0),
    ("august", 0.0);

-- Define a trigger.
create trigger total_trigger
before insert on job
begin
    -- Check that the person exists.
    select case
        when not exists (select 1 from total where person = new.person)
        then raise(rollback, 'Unknown person ')
    end;
    -- Update their total hours (or fail if non-negative constraint violated).
    update total
    set hours = hours + new.reported
    where total.person = new.person;
end;

src/trigger_successful.sql

insert into job values
    ('gene', 1.5),
    ('august', 0.5),
    ('gene', 1.0)
;

out/trigger_successful.out

| person | reported |
|--------|----------|
| gene   | 1.5      |
| august | 0.5      |
| gene   | 1.0      |

| person | hours |
|--------|-------|
| gene   | 2.5   |
| august | 0.5   |

081: trigger firing

src/trigger_firing.sql

insert into job values
    ('gene', 1.0),
    ('august', -1.0)
;

out/trigger_firing.out

Runtime error near line 6: CHECK constraint failed: reported >= 0.0 (19)

| person | hours |
|--------|-------|
| gene   | 0.0   |
| august | 0.0   |

 represent graphs

src/lineage_setup.sql

create table lineage (
    parent text not null,
    child text not null
);
insert into lineage values
    ('Arturo', 'Clemente'),
    ('Dario', 'Clemente'),
    ('Clemente', 'Homero'),
    ('Clemente', 'Ivonne'),
    ('Ivonne', 'Lourdes'),
    ('Soledad', 'Lourdes'),
    ('Lourdes', 'Santiago');

src/represent_graph.sql

select * from lineage;

out/represent_graph.out

|  parent  |  child   |
|----------|----------|
| Arturo   | Clemente |
| Dario    | Clemente |
| Clemente | Homero   |
| Clemente | Ivonne   |
| Ivonne   | Lourdes  |
| Soledad  | Lourdes  |
| Lourdes  | Santiago |

lineage diagram

84: recursive query

src/recursive_lineage.sql

with recursive descendent as (
    select
        'Clemente' as person,
        0 as generations
    union all
    select
        lineage.child as person,
        descendent.generations + 1 as generations
    from descendent join lineage
    on descendent.person = lineage.parent
)
select person, generations from descendent;

out/recursive_lineage.out

|  person  | generations |
|----------|-------------|
| Clemente | 0           |
| Homero   | 1           |
| Ivonne   | 1           |
| Lourdes  | 2           |
| Santiago | 3           |

  * Use a recursive CTE to create a temporary table (descendent)
  * Base case seeds this table
  * Recursive case relies on value(s) already in that table and
    external table(s)
  * union all to combine rows
      + Can use union but that has lower performance (must check
        uniqueness each time)
  * Stops when the recursive case yields an empty row set (nothing
    new to add)
  * Then select the desired values from the CTE

 contact tracing database

src/contact_person.sql

select * from person;

out/contact_person.out

| ident |         name          |
|-------|-----------------------|
| 1     | Juana Baeza           |
| 2     | Agustin Rodriquez     |
| 3     | Ariadna Caraballo     |
| 4     | Micaela Laboy         |
| 5     | Veronica Altamirano   |
| 6     | Reina Rivero          |
| 7     | Elias Merino          |
| 8     | Minerva Guerrero      |
| 9     | Mauro Balderas        |
| 10    | Pilar Alarcon         |
| 11    | Daniela Menendez      |
| 12    | Marco Antonio Barrera |
| 13    | Cristal Soliz         |
| 14    | Bernardo Narvaez      |
| 15    | Oscar Barrios         |

src/contact_contacts.sql

select * from contact;

out/contact_contacts.out

|       left        |         right         |
|-------------------|-----------------------|
| Agustin Rodriquez | Ariadna Caraballo     |
| Agustin Rodriquez | Veronica Altamirano   |
| Juana Baeza       | Veronica Altamirano   |
| Juana Baeza       | Micaela Laboy         |
| Pilar Alarcon     | Reina Rivero          |
| Cristal Soliz     | Marco Antonio Barrera |
| Cristal Soliz     | Daniela Menendez      |
| Daniela Menendez  | Marco Antonio Barrera |

contact diagram

85: bidirectional contacts

src/bidirectional.sql

create temporary table bi_contact (
    left text,
    right text
);

insert into bi_contact
select
    left, right from contact
    union all
    select right, left from contact
;

out/bidirectional.out

| original_count |
|----------------|
| 8              |

| num_contact |
|-------------|
| 16          |

  * Create a temporary table rather than using a long chain of CTEs
      + Only lasts as long as the session (not saved to disk)
  * Duplicate information rather than writing more complicated query

86: update group identifiers

src/update_group_ids.sql

select
    left.name as left_name,
    left.ident as left_ident,
    right.name as right_name,
    right.ident as right_ident,
    min(left.ident, right.ident) as new_ident
from
    (person as left join bi_contact on left.name = bi_contact.left)
    join person as right on bi_contact.right = right.name;

out/update_group_ids.out

|       left_name       | left_ident |      right_name       | right_ident | new_ident |
|-----------------------|------------|-----------------------|-------------|-----------|
| Juana Baeza           | 1          | Micaela Laboy         | 4           | 1         |
| Juana Baeza           | 1          | Veronica Altamirano   | 5           | 1         |
| Agustin Rodriquez     | 2          | Ariadna Caraballo     | 3           | 2         |
| Agustin Rodriquez     | 2          | Veronica Altamirano   | 5           | 2         |
| Ariadna Caraballo     | 3          | Agustin Rodriquez     | 2           | 2         |
| Micaela Laboy         | 4          | Juana Baeza           | 1           | 1         |
| Veronica Altamirano   | 5          | Agustin Rodriquez     | 2           | 2         |
| Veronica Altamirano   | 5          | Juana Baeza           | 1           | 1         |
| Reina Rivero          | 6          | Pilar Alarcon         | 10          | 6         |
| Pilar Alarcon         | 10         | Reina Rivero          | 6           | 6         |
| Daniela Menendez      | 11         | Cristal Soliz         | 13          | 11        |
| Daniela Menendez      | 11         | Marco Antonio Barrera | 12          | 11        |
| Marco Antonio Barrera | 12         | Cristal Soliz         | 13          | 12        |
| Marco Antonio Barrera | 12         | Daniela Menendez      | 11          | 11        |
| Cristal Soliz         | 13         | Daniela Menendez      | 11          | 11        |
| Cristal Soliz         | 13         | Marco Antonio Barrera | 12          | 12        |

  * new_ident is minimum of own identifier and identifiers one step
    away
  * Doesn't keep people with no contacts

87: recursive labeling

src/recursive_labeling.sql

with recursive labeled as (
    select
        person.name as name,
        person.ident as label
    from
        person
    union -- not 'union all'
    select
        person.name as name,
        labeled.label as label
    from
        (person join bi_contact on person.name = bi_contact.left)
        join labeled on bi_contact.right = labeled.name
    where labeled.label < person.ident
)
select name, min(label) as group_id
from labeled
group by name
order by label, name;

out/recursive_labeling.out

|         name          | group_id |
|-----------------------|----------|
| Agustin Rodriquez     | 1        |
| Ariadna Caraballo     | 1        |
| Juana Baeza           | 1        |
| Micaela Laboy         | 1        |
| Veronica Altamirano   | 1        |
| Pilar Alarcon         | 6        |
| Reina Rivero          | 6        |
| Elias Merino          | 7        |
| Minerva Guerrero      | 8        |
| Mauro Balderas        | 9        |
| Cristal Soliz         | 11       |
| Daniela Menendez      | 11       |
| Marco Antonio Barrera | 11       |
| Bernardo Narvaez      | 14       |
| Oscar Barrios         | 15       |

  * Use union instead of union all to prevent infinite recursion

88: query from Python

src/basic_python_query.py

import sqlite3

connection = sqlite3.connect("db/penguins.db")
cursor = connection.execute("select count(*) from penguins;")
rows = cursor.fetchall()
print(rows)

out/basic_python_query.out

[(344,)]

  * sqlite3 is part of Python's standard library
  * Create a connection to a database file
  * Get a cursor by executing a query
      + More common to create cursor and use that to run queries
  * Fetch all rows at once as list of tuples

89: incremental fetch

src/incremental_fetch.py

import sqlite3

connection = sqlite3.connect("db/penguins.db")
cursor = connection.cursor()
cursor = cursor.execute("select species, island from penguins limit 5;")
while row := cursor.fetchone():
    print(row)

out/incremental_fetch.out

('Adelie', 'Torgersen')
('Adelie', 'Torgersen')
('Adelie', 'Torgersen')
('Adelie', 'Torgersen')
('Adelie', 'Torgersen')

  * cursor.fetchone returns None when no more data
  * There is also fetchmany(N) to fetch (up to) a certain number of
    rows

90: insert, delete, and all that

src/insert_delete.py

import sqlite3

connection = sqlite3.connect(":memory:")
cursor = connection.cursor()
cursor.execute("create table example(num integer);")

cursor.execute("insert into example values (10), (20);")
print("after insertion", cursor.execute("select * from example;").fetchall())

cursor.execute("delete from example where num < 15;")
print("after deletion", cursor.execute("select * from example;").fetchall())

out/insert_delete.out

after insertion [(10,), (20,)]
after deletion [(20,)]

  * Each execute is its own transaction

91: interpolate values

src/interpolate.py

import sqlite3

connection = sqlite3.connect(":memory:")
cursor = connection.cursor()
cursor.execute("create table example(num integer);")

cursor.executemany("insert into example values (?);", [(10,), (20,)])
print("after insertion", cursor.execute("select * from example;").fetchall())

out/interpolate.out

after insertion [(10,), (20,)]

  * From XKCD

XKCD Exploits of a Mom

92: script execution

src/script_execution.py

import sqlite3

SETUP = """\
drop table if exists example;
create table example(num integer);
insert into example values (10), (20);
"""

connection = sqlite3.connect(":memory:")
cursor = connection.cursor()
cursor.executescript(SETUP)
print("after insertion", cursor.execute("select * from example;").fetchall())

out/script_execution.out

after insertion [(10,), (20,)]

  * But what if something goes wrong?

93: SQLite exceptions in Python

src/exceptions.py

import sqlite3

SETUP = """\
create table example(num integer check(num > 0));
insert into example values (10);
insert into example values (-1);
insert into example values (20);
"""

connection = sqlite3.connect(":memory:")
cursor = connection.cursor()
try:
    cursor.executescript(SETUP)
except sqlite3.Error as exc:
    print(f"SQLite exception: {exc}")
print("after execution", cursor.execute("select * from example;").fetchall())

out/exceptions.out

SQLite exception: CHECK constraint failed: num > 0
after execution [(10,)]

94: Python in SQLite

src/embedded_python.py

import sqlite3

SETUP = """\
create table example(num integer);
insert into example values (-10), (10), (20), (30);
"""


def clip(value):
    if value < 0:
        return 0
    if value > 20:
        return 20
    return value


connection = sqlite3.connect(":memory:")
connection.create_function("clip", 1, clip)
cursor = connection.cursor()
cursor.executescript(SETUP)
for row in cursor.execute("select num, clip(num) from example;").fetchall():
    print(row)

out/embedded_python.out

(-10, 0)
(10, 10)
(20, 20)
(30, 20)

  * SQLite calls back into Python to execute the function
  * Other databases can run Python (and other languages) in the
    database server process
  * Be careful

95: handle dates and times

src/dates_times.py

from datetime import date
import sqlite3


# Convert date to ISO-formatted string when writing to database
def _adapt_date_iso(val):
    return val.isoformat()


sqlite3.register_adapter(date, _adapt_date_iso)


# Convert ISO-formatted string to date when reading from database
def _convert_date(val):
    return date.fromisoformat(val.decode())


sqlite3.register_converter("date", _convert_date)

SETUP = """\
create table events(
    happened date not null,
    description text not null
);
"""

connection = sqlite3.connect(":memory:", detect_types=sqlite3.PARSE_DECLTYPES)
cursor = connection.cursor()
cursor.execute(SETUP)

cursor.executemany(
    "insert into events values (?, ?);",
    [(date(2024, 1, 10), "started tutorial"), (date(2024, 1, 29), "finished tutorial")],
)

for row in cursor.execute("select * from events;").fetchall():
    print(row)

out/dates_times.out

(datetime.date(2024, 1, 10), 'started tutorial')
(datetime.date(2024, 1, 29), 'finished tutorial')

  * sqlite3.PARSE_DECLTYPES tells sqlite3 library to use converts
    based on declared column types
  * Adapt on the way in, convert on the way out

96: SQL in Jupyter notebooks

src/install_jupysql.sh

pip install jupysql

  * And then inside the notebook:

src/load_ext.text

%load_ext sql

  * Loads extension

src/jupyter_connect.text

%sql sqlite:///data/penguins.db

out/jupyter_connect.out

Connecting to 'sqlite:///data/penguins.db'

  * Connects to database
      + sqlite:// with two slashes is the protocol
      + /data/penguins.db (one leading slash) is a local path
  * Single percent sign %sql introduces one-line command
  * Use double percent sign %%sql to indicate that the rest of the
    cell is SQL

src/jupyter_select.text

%%sql
select species, count(*) as num
from penguins
group by species;

out/jupyter_select.out

Running query in 'sqlite:///data/penguins.db'

 species   num
Adelie     152
Chinstrap  68
Gentoo     124

97: Pandas and SQL

src/install_pandas.sh

pip install pandas

src/select_pandas.py

import pandas as pd
import sqlite3

connection = sqlite3.connect("db/penguins.db")
query = "select species, count(*) as num from penguins group by species;"
df = pd.read_sql(query, connection)
print(df)

out/select_pandas.out

species  num
0     Adelie  152
1  Chinstrap   68
2     Gentoo  124

  * Be careful about datatype conversion

98: Polars and SQL

src/install_polars.sh

pip install polars pyarrow adbc-driver-sqlite

src/select_polars.py

import polars as pl

query = "select species, count(*) as num from penguins group by species;"
uri = "sqlite:///db/penguins.db"
df = pl.read_database_uri(query, uri, engine="adbc")
print(df)

out/select_polars.out

shape: (3, 2)
+-----------+-----+
| species   | num |
| ---       | --- |
| str       | i64 |
+-----------+-----+
| Adelie    | 152 |
| Chinstrap | 68  |
| Gentoo    | 124 |
+-----------+-----+

  * The Uniform Resource Identifier (URI) specifies the database
  * The query is the query
  * Use the ADBC engine instead of the default ConnectorX

99: object-relational mapper

src/orm.py

from sqlmodel import Field, Session, SQLModel, create_engine, select


class Department(SQLModel, table=True):
    ident: str = Field(default=None, primary_key=True)
    name: str
    building: str


engine = create_engine("sqlite:///db/assays.db")
with Session(engine) as session:
    statement = select(Department)
    for result in session.exec(statement).all():
        print(result)

out/orm.out

building='Chesson' name='Genetics' ident='gen'
building='Fashet Extension' name='Histology' ident='hist'
building='Chesson' name='Molecular Biology' ident='mb'
building='TGVH' name='Endocrinology' ident='end'

  * An object-relational mapper (ORM) translates table columns to
    object properties and vice versa
  * SQLModel relies on Python type hints

100: relations with ORM

src/orm_relation.py

class Staff(SQLModel, table=True):
    ident: str = Field(default=None, primary_key=True)
    personal: str
    family: str
    dept: Optional[str] = Field(default=None, foreign_key="department.ident")
    age: int


engine = create_engine("sqlite:///db/assays.db")
SQLModel.metadata.create_all(engine)
with Session(engine) as session:
    statement = select(Department, Staff).where(Staff.dept == Department.ident)
    for dept, staff in session.exec(statement):
        print(f"{dept.name}: {staff.personal} {staff.family}")

out/orm_relation.out

Histology: Divit Dhaliwal
Molecular Biology: Indrans Sridhar
Molecular Biology: Pranay Khanna
Histology: Vedika Rout
Genetics: Abram Chokshi
Histology: Romil Kapoor
Molecular Biology: Ishaan Ramaswamy
Genetics: Nitya Lal

  * Make foreign keys explicit in class definitions
  * SQLModel automatically does the join
      + The two staff with no department aren't included in the
        result

 Appendices

 Terms

1-to-1 relation
    A relationship between two tables in which each record from the
    first table matches exactly one record from the second and vice
    versa.
1-to-many relation
    A relationship between two tables in which each record from the
    first table matches zero or more records from the second, but
    each record from the second table matches exactly one record from
    the first.
aggregation
    Combining several values to produce one.
aggregation function
    A function used to produce one value from many, such as maximum
    or addition.
alias
    An alternate name used temporarily for a table or column.
atomic
    An operation that cannot be broken into smaller operations.
autoincrement
    Automatically add one to a value.
base case
    A starting point for recursion that does not depend on previous
    recursive calculations.
Binary Large Object (blob)
    Bytes that are handled as-is rather than being interpreted as
    numbers, text, or other data types.
cross join
    A join that creates the cross-product of rows from two tables.
common table expression (CTE)
    A temporary table created at the start of a query, usually to
    simplify writing the query.
consistent
    A state in which all constraints are satisfied, e.g., all columns
    contain allowed values and all foreign keys refer to primary
    keys.
correlated subquery
    A subquery that depends on a value or values from the enclosing
    query, and which must therefore be executed once for each of
    those values.
cursor
    A reference to the current location in the results of an ongoing
    query.
data migration
    To move data from one form to another, e.g., from one set of
    tables to a new set or from one DBMS to another.
database
    A collection of data that can be searched and retrieved.
database management system (DBMS)
    A program that manages a particular kind of database.
denormalization
    To deliberately introduce duplication or other violate normal
    forms, typically to improve query performance.
durable
    Guaranteed to survive shutdown and restart.
entity-relationship diagram
    A graphical depiction of the relationships between tables in a
    database.
filter
    To select records based on whether they pass some Boolean test.
foreign key
    A value in one table that identifies a primary key in another
    table.
full outer join
    See cross join.
group
    A set of records that share a common property, such as having the
    same value in a particular column.
in-memory database
    A database that is stored in memory rather than on disk.
index
    An auxiliary data structure that enables faster access to
    records.
infinite recursion
    See "infinite recursion".
isolated
    The appearance of having executed in an otherwise-idle system.
join
    To combine records from two tables.
join condition
    The criteria used to decide which rows from each table in a join
    are combined.
join table
    A table that exists solely to enable information from two tables
    to be connected.
left outer join
    A join that is guaranteed to keep all rows from the first (left)
    table. Columns from the right table are filled with actual values
    if available or with null otherwise.
many-to-many relation
    A relationship between two tables in which each record from the
    first table may match zero or more records from the second and
    vice versa.
materialized view
    A view that is stored on disk and updated on demand.
normal form
    One of several (loosely defined) rules for organizing data in
    tables.
NoSQL database
    Any database that doesn't use the relational model.
null
    A special value representing "not known".
object-relational mapper (ORM)
    A library that translates objects in a program into database
    queries and the results of those queries back into objects.
path expression
    An expression identifying an element or a set of elements in a
    JSON structure.
primary key
    A value or values in a database table that uniquely identifies
    each record in that table.
query
    A command to perform some operation in a database (typically data
    retrieval).
recursive CTE
    A common table expression that refers to itself. Every recursive
    CTE must have a base case and a recursive case.
recursive case
    The second or subsequent step in self-referential accumulation of
    data.
relational database management system (RDBMS)
    A database management system that stores data in tables with
    columns and rows.
subquery
    A query used within another query.
table-valued function
    A function that returns multiple values rather than a single
    value.
temporary table
    A table that is explicitly constructed in memory outside any
    particular query.
ternary logic
    A logic based on three values: true, false, and "don't know"
    (represented as null).
tombstone
    A marker value added to a record to show that it is no longer
    active. Tombstones are used as an alternative to deleting data.
trigger
    An action that runs automatically when something happens in a
    database, typically insertion or deletion.
upsert
    To update a record if it exists or insert (create) a new record
    if it doesn't.
Uniform Resource Identifier (URI)
    A string that identifies a resource (such as a web page or
    database) and the protocol used to access it.
view
    A rearrangement of data in a database that is regenerated on
    demand.
window function
    A function that combines data from adjacent rows in a database
    query's result.

 Acknowledgments

This tutorial would not have been possible without:

  * Andi Albrecht's sqlparse module
  * Dimitri Fontaine's The Art of PostgreSQL
  * David Rozenshtein's The Essence of SQL (now sadly out of print)

I would also like to thank the following for spotting issues, making
suggestions, or submitting changes:

  * Sam Hames
  * Robert Kern
  * Roy Pardee
  * Manos Pitsidianakis
  * Daniel Possenriede
  * Adam Rosien
  * Thomas Sandmann

2024 Greg Wilson *