https://gvwilson.github.io/sql-tutorial/ Home repo download license conduct contributing colophon SQL for Data Scientists in 100 Queries what this is * notes and working examples that instructors can use to perform a lesson + do not expect novices with no prior SQL experience to be able to learn from them * musical analogy + this is the chord changes and melody + we expect instructors to create an arrangement and/or improvise while delivering + see Teaching Tech Together for background scope * intended audience + Rachel has a master's degree in cell biology and now works in a research hospital doing cell assays. + She learned a bit of R in an undergrad biostatistics course and has been through the Carpentries lesson on the Unix shell . + Rachel is thinking about becoming a data scientist and would like to understand how data is stored and managed. + Her work schedule is unpredictable and highly variable, so she needs to be able to learn a bit at a time. * prerequisites + basic Unix command line: cd, ls, * wildcard + basic tabular data analysis: filtering rows, aggregating within groups * learning outcomes 1. Explain the difference between a database and a database manager. 2. Write SQL to select, filter, sort, group, and aggregate data. 3. Define tables and insert, update, and delete records. 4. Describe different types of join and write queries that use them to combine data. 5. Use windowing functions to operate on adjacent rows. 6. Explain what transactions are and write queries that roll back when constraints are violated. 7. Explain what triggers are and write SQL to create them. 8. Manipulate JSON data using SQL. 9. Interact with a database using Python directly, from a Jupyter notebook, and via an ORM. setup * Download the latest release * Unzip the file in a temporary directory to create: + ./db/*.db: the SQLite databases used in the examples + ./src/*.*: SQL queries, Python scripts, and other source code + ./out/*.*: expected output for examples background concepts * A database is a collection of data that can be searched and retrieved * A database management system (DBMS) is a program that manages a particular kind of database * Each DBMS stores data in its own way + SQLite stores each database in a single file + PostgreSQL spreads information across many files for higher performance * DBMS can be a library embedded in other programs (SQLite) or a server (PostgreSQL) * A relational database management system (RDBMS) stores data in tables and uses SQL for queries + Unfortunately, every RDBMS has its own dialect of SQL * There are also NoSQL databases like MongoDB that don't use tables concept map: overview connect to database src/connect_penguins.sh sqlite3 data/penguins.db * Not actually a query * But we have to do it before we can do anything else 1: select constant src/select_1.sql select 1; out/select_1.out 1 * select is a keyword * Normally used to select data from table... * ...but if all we want is a constant value, we don't need to specify one * Semi-colon terminator is required 2: select all values from table src/select_star.sql select * from little_penguins; out/select_star.out Adelie|Dream|37.2|18.1|178|3900|MALE Adelie|Dream|37.6|19.3|181|3300|FEMALE Gentoo|Biscoe|50|15.3|220|5550|MALE Adelie|Torgersen|37.3|20.5|199|3775|MALE Adelie|Biscoe|39.6|17.7|186|3500|FEMALE Gentoo|Biscoe|47.7|15|216|4750|FEMALE Adelie|Dream|36.5|18|182|3150|FEMALE Gentoo|Biscoe|42|13.5|210|4150|FEMALE Adelie|Torgersen|42.1|19.1|195|4000|MALE Gentoo|Biscoe|54.3|15.7|231|5650|MALE * An actual query * Use * to mean "all columns" * Use from tablename to specify table * Output format is not particularly readable administrative commands src/admin_commands.sql .headers on .mode markdown select * from little_penguins; out/admin_commands.out | species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |---------|-----------|----------------|---------------|-------------------|-------------|--------| | Adelie | Dream | 37.2 | 18.1 | 178 | 3900 | MALE | | Adelie | Dream | 37.6 | 19.3 | 181 | 3300 | FEMALE | | Gentoo | Biscoe | 50 | 15.3 | 220 | 5550 | MALE | | Adelie | Torgersen | 37.3 | 20.5 | 199 | 3775 | MALE | | Adelie | Biscoe | 39.6 | 17.7 | 186 | 3500 | FEMALE | | Gentoo | Biscoe | 47.7 | 15 | 216 | 4750 | FEMALE | | Adelie | Dream | 36.5 | 18 | 182 | 3150 | FEMALE | | Gentoo | Biscoe | 42 | 13.5 | 210 | 4150 | FEMALE | | Adelie | Torgersen | 42.1 | 19.1 | 195 | 4000 | MALE | | Gentoo | Biscoe | 54.3 | 15.7 | 231 | 5650 | MALE | * SQLite administrative commands start with . and aren't part of the SQL standard + PostgreSQL's special commands start with \ * Use .help for a complete list 3: specify columns src/specify_columns.sql select species, island, sex from little_penguins; out/specify_columns.out | species | island | sex | |---------|-----------|--------| | Adelie | Dream | MALE | | Adelie | Dream | FEMALE | | Gentoo | Biscoe | MALE | | Adelie | Torgersen | MALE | | Adelie | Biscoe | FEMALE | | Gentoo | Biscoe | FEMALE | | Adelie | Dream | FEMALE | | Gentoo | Biscoe | FEMALE | | Adelie | Torgersen | MALE | | Gentoo | Biscoe | MALE | * Specify column names separated by commas + In any order + Duplicates allowed * Line breaks [S:allowed:S] encouraged for readability 4: sort src/sort.sql select species, sex, island from little_penguins order by island asc, sex desc; out/sort.out | species | sex | island | |---------|--------|-----------| | Gentoo | MALE | Biscoe | | Gentoo | MALE | Biscoe | | Adelie | FEMALE | Biscoe | | Gentoo | FEMALE | Biscoe | | Gentoo | FEMALE | Biscoe | | Adelie | MALE | Dream | | Adelie | FEMALE | Dream | | Adelie | FEMALE | Dream | | Adelie | MALE | Torgersen | | Adelie | MALE | Torgersen | * order by must follow from (which must follow select) * asc is ascending, desc is descending + Default is ascending, but please specify 5: limit output * Full dataset has 344 rows src/limit.sql select species, sex, island from penguins order by species, sex, island limit 10; out/limit.out | species | sex | island | |---------|--------|-----------| | Adelie | | Dream | | Adelie | | Torgersen | | Adelie | | Torgersen | | Adelie | | Torgersen | | Adelie | | Torgersen | | Adelie | | Torgersen | | Adelie | FEMALE | Biscoe | | Adelie | FEMALE | Biscoe | | Adelie | FEMALE | Biscoe | | Adelie | FEMALE | Biscoe | * Comments start with -- and run to the end of the line * limit N specifies maximum number of rows returned by query 6: page output src/page.sql select species, sex, island from penguins order by species, sex, island limit 10 offset 3; out/page.out | species | sex | island | |---------|--------|-----------| | Adelie | | Torgersen | | Adelie | | Torgersen | | Adelie | | Torgersen | | Adelie | FEMALE | Biscoe | | Adelie | FEMALE | Biscoe | | Adelie | FEMALE | Biscoe | | Adelie | FEMALE | Biscoe | | Adelie | FEMALE | Biscoe | | Adelie | FEMALE | Biscoe | | Adelie | FEMALE | Biscoe | * offset N must follow limit * Specifies number of rows to skip from the start of the selection * So this query skips the first 3 and shows the next 10 7: remove duplicates src/distinct.sql select distinct species, sex, island from penguins; out/distinct.out | species | sex | island | |-----------|--------|-----------| | Adelie | MALE | Torgersen | | Adelie | FEMALE | Torgersen | | Adelie | | Torgersen | | Adelie | FEMALE | Biscoe | | Adelie | MALE | Biscoe | | Adelie | FEMALE | Dream | | Adelie | MALE | Dream | | Adelie | | Dream | | Chinstrap | FEMALE | Dream | | Chinstrap | MALE | Dream | | Gentoo | FEMALE | Biscoe | | Gentoo | MALE | Biscoe | | Gentoo | | Biscoe | * distinct keyword must appear right after select + SQL was supposed to read like English * Shows distinct combinations * Blanks in sex column show missing data + We'll talk about this in a bit 8: filter results src/filter.sql select distinct species, sex, island from penguins where island = 'Biscoe'; out/filter.out | species | sex | island | |---------|--------|--------| | Adelie | FEMALE | Biscoe | | Adelie | MALE | Biscoe | | Gentoo | FEMALE | Biscoe | | Gentoo | MALE | Biscoe | | Gentoo | | Biscoe | * where condition filters the rows produced by selection * Condition is evaluated independently for each row * Only rows that pass the test appear in results * Use single quotes for 'text data' and double quotes for "weird column names" + SQLite will accept double-quoted text data 9: filter with more complex conditions src/filter_and.sql select distinct species, sex, island from penguins where island = 'Biscoe' and sex != 'MALE'; out/filter_and.out | species | sex | island | |---------|--------|--------| | Adelie | FEMALE | Biscoe | | Gentoo | FEMALE | Biscoe | * and: both sub-conditions must be true * or: either or both part must be true * Notice that the row for Gentoo penguins on Biscoe island with unknown (empty) sex didn't pass the test + We'll talk about this in a bit 10: do calculations src/calculations.sql select flipper_length_mm / 10.0, body_mass_g / 1000.0 from penguins limit 3; out/calculations.out | flipper_length_mm / 10.0 | body_mass_g / 1000.0 | |--------------------------|----------------------| | 18.1 | 3.75 | | 18.6 | 3.8 | | 19.5 | 3.25 | * Can do the usual kinds of arithmetic on individual values + Calculation done for each row independently * Column name shows the calculation done 11: rename columns src/rename_columns.sql select flipper_length_mm / 10.0 as flipper_cm, body_mass_g / 1000.0 as weight_kg, island as where_found from penguins limit 3; out/rename_columns.out | flipper_cm | weight_kg | where_found | |------------|-----------|-------------| | 18.1 | 3.75 | Torgersen | | 18.6 | 3.8 | Torgersen | | 19.5 | 3.25 | Torgersen | * Use expression as name to rename * Give result of calculation a meaningful name * Can also rename columns without modifying check your understanding concept map: selection 12: calculate with missing values src/show_missing_values.sql select flipper_length_mm / 10.0 as flipper_cm, body_mass_g / 1000.0 as weight_kg, island as where_found from penguins limit 5; out/show_missing_values.out | flipper_cm | weight_kg | where_found | |------------|-----------|-------------| | 18.1 | 3.75 | Torgersen | | 18.6 | 3.8 | Torgersen | | 19.5 | 3.25 | Torgersen | | | | Torgersen | | 19.3 | 3.45 | Torgersen | * SQL uses a special value null to representing missing data + Not 0 or empty string, but "I don't know" * Flipper length and body weight not known for one of the first five penguins * "I don't know" divided by 10 or 1000 is "I don't know" 13: null equality * Repeated from above so it doesn't count against our query limit src/filter.sql select distinct species, sex, island from penguins where island = 'Biscoe'; out/filter.out | species | sex | island | |---------|--------|--------| | Adelie | FEMALE | Biscoe | | Adelie | MALE | Biscoe | | Gentoo | FEMALE | Biscoe | | Gentoo | MALE | Biscoe | | Gentoo | | Biscoe | * If we ask for female penguins the row with the missing sex drops out src/null_equality.sql select distinct species, sex, island from penguins where island = 'Biscoe' and sex = 'FEMALE'; out/null_equality.out | species | sex | island | |---------|--------|--------| | Adelie | FEMALE | Biscoe | | Gentoo | FEMALE | Biscoe | 14: null inequality * But if we ask for penguins that aren't female it drops out as well src/null_inequality.sql select distinct species, sex, island from penguins where island = 'Biscoe' and sex != 'FEMALE'; out/null_inequality.out | species | sex | island | |---------|------|--------| | Adelie | MALE | Biscoe | | Gentoo | MALE | Biscoe | 15: ternary logic src/ternary_logic.sql select null = null; out/ternary_logic.out | null = null | |-------------| | | * If we don't know the left and right values, we don't know if they're equal or not * So the result is null * Get the same answer for null != null * Ternary logic equality X Y null X true false null Y false true null null null null null 16: handle null safely src/safe_null_equality.sql select species, sex, island from penguins where sex is null; out/safe_null_equality.out | species | sex | island | |---------|-----|-----------| | Adelie | | Torgersen | | Adelie | | Torgersen | | Adelie | | Torgersen | | Adelie | | Torgersen | | Adelie | | Torgersen | | Adelie | | Dream | | Gentoo | | Biscoe | | Gentoo | | Biscoe | | Gentoo | | Biscoe | | Gentoo | | Biscoe | | Gentoo | | Biscoe | * Use is null and is not null to handle null safely * Other parts of SQL handle nulls specially check your understanding concept map: null 17: aggregate src/simple_sum.sql select sum(body_mass_g) as total_mass from penguins; out/simple_sum.out | total_mass | |------------| | 1437000 | * Aggregation combines many values to produce one * sum is an aggregation function * Combines corresponding values from multiple rows 18: common aggregation functions src/common_aggregations.sql select max(bill_length_mm) as longest_bill, min(flipper_length_mm) as shortest_flipper, avg(bill_length_mm) / avg(bill_depth_mm) as weird_ratio from penguins; out/common_aggregations.out | longest_bill | shortest_flipper | weird_ratio | |--------------|------------------|------------------| | 59.6 | 172 | 2.56087082530644 | * This actually shouldn't work: can't calculate maximum or average if any values are null * SQL does the useful thing instead of the right one 19: counting src/count_behavior.sql select count(*) as count_star, count(sex) as count_specific from penguins; out/count_behavior.out | count_star | count_specific | |------------|----------------| | 344 | 333 | * count(*) counts rows * count(column) counts non-null entries in column 20: group src/simple_group.sql select avg(body_mass_g) as average_mass_g from penguins group by sex; out/simple_group.out | average_mass_g | |------------------| | 4005.55555555556 | | 3862.27272727273 | | 4545.68452380952 | * Put rows in groups based on distinct combinations of values in columns specified with group by * Then perform aggregation separately for each group * But which is which? 21: behavior of unaggregated columns src/unaggregated_columns.sql select sex, avg(body_mass_g) as average_mass_g from penguins group by sex; out/unaggregated_columns.out | sex | average_mass_g | |--------|------------------| | | 4005.55555555556 | | FEMALE | 3862.27272727273 | | MALE | 4545.68452380952 | * All rows in each group have the same value for sex, so no need to aggregate 22: arbitrary choice in aggregation src/arbitrary_in_aggregation.sql select sex, body_mass_g from penguins group by sex; out/arbitrary_in_aggregation.out | sex | body_mass_g | |--------|-------------| | | | | FEMALE | 3800 | | MALE | 3750 | * If we don't specify how to aggregate a column, SQL can choose any arbitrary value from the group * All penguins in each group have the same sex because we grouped by that, so we get the right answer * The body mass values are in the data but unpredictable * A common mistake 23: filter aggregated values src/filter_aggregation.sql select sex, avg(body_mass_g) as average_mass_g from penguins group by sex having average_mass_g > 4000.0; out/filter_aggregation.out | sex | average_mass_g | |------|------------------| | | 4005.55555555556 | | MALE | 4545.68452380952 | * Using having condition instead of where condition for aggregates 24: readable output src/readable_aggregation.sql select sex, round(avg(body_mass_g), 1) as average_mass_g from penguins group by sex having average_mass_g > 4000.0; out/readable_aggregation.out | sex | average_mass_g | |------|----------------| | | 4005.6 | | MALE | 4545.7 | * Use round(value, decimals) to round off a number 25: filter aggregate inputs src/filter_aggregate_inputs.sql select sex, round( avg(body_mass_g) filter (where body_mass_g < 4000.0), 1 ) as average_mass_g from penguins group by sex; out/filter_aggregate_inputs.out | sex | average_mass_g | |--------|----------------| | | 3362.5 | | FEMALE | 3417.3 | | MALE | 3752.5 | * filter (where condition) applies to inputs check your understanding concept map: aggregation create in-memory database src/in_memory_db.sh sqlite3 :memory: * "Connect" to an in-memory database 26: create tables src/create_work_job.sql create table job ( name text not null, billable real not null ); create table work ( person text not null, job text not null ); * create table name followed by parenthesized list of columns * Each column is a name, a data type, and optional extra information + E.g., not null prevents nulls from being added * .schema is not standard SQL * SQLite has added a few things + create if not exists + upper-case keywords (SQL is case insensitive) 27: insert data src/populate_work_job.sql insert into job values ('calibrate', 1.5), ('clean', 0.5); insert into work values ('mik', 'calibrate'), ('mik', 'clean'), ('mik', 'complain'), ('po', 'clean'), ('po', 'complain'), ('tay', 'complain'); out/insert_values.out | name | billable | |-----------|----------| | calibrate | 1.5 | | clean | 0.5 | | person | job | |--------|-----------| | mik | calibrate | | mik | clean | | mik | complain | | po | clean | | po | complain | | tay | complain | 28: update rows src/update_work_job.sql update work set person = "tae" where person = "tay"; out/update_rows.out | person | job | |--------|-----------| | mik | calibrate | | mik | clean | | mik | complain | | po | clean | | po | complain | | tae | complain | * (Almost) always specify row(w) to update using where + Would otherwise update all rows * Useful to give each row a primary key that uniquely identifies it for this purpose + Will see other uses below 29: delete rows src/delete_rows.sql delete from work where person = "tae"; select * from work; out/delete_rows.out | person | job | |--------|-----------| | mik | calibrate | | mik | clean | | mik | complain | | po | clean | | po | complain | * Again, (almost) always specify row(s) to delete using where 30: backing up src/backing_up.sql create table backup ( person text not null, job text not null ); insert into backup select person, job from work where person = 'tae'; delete from work where person = 'tae'; select * from backup; out/backing_up.out | person | job | |--------|----------| | tae | complain | check your understanding concept map: data definition and modification 31: join tables src/cross_join.sql select * from work cross join job; out/cross_join.out | person | job | name | billable | |--------|-----------|-----------|----------| | mik | calibrate | calibrate | 1.5 | | mik | calibrate | clean | 0.5 | | mik | clean | calibrate | 1.5 | | mik | clean | clean | 0.5 | | mik | complain | calibrate | 1.5 | | mik | complain | clean | 0.5 | | po | clean | calibrate | 1.5 | | po | clean | clean | 0.5 | | po | complain | calibrate | 1.5 | | po | complain | clean | 0.5 | | tay | complain | calibrate | 1.5 | | tay | complain | clean | 0.5 | * A join combines information from two tables * full outer join (also called cross join) constructs their cross product + All combinations of rows from each * Result isn't particularly useful: job and name don't match 32: inner join src/inner_join.sql select * from work inner join job on work.job = job.name; out/inner_join.out | person | job | name | billable | |--------|-----------|-----------|----------| | mik | calibrate | calibrate | 1.5 | | mik | clean | clean | 0.5 | | po | clean | clean | 0.5 | * Use table.column notation to specify columns + A column can have the same name as a table * Use on condition to specify join condition * Since complain doesn't appear in job.name, none of those rows are kept 33: aggregate joined data src/aggregate_join.sql select work.person, sum(job.billable) as pay from work inner join job on work.job = job.name group by work.person; out/aggregate_join.out | person | pay | |--------|-----| | mik | 2.0 | | po | 0.5 | * Combines ideas we've seen before * But Tay is missing from the table 34: left join src/left_join.sql select * from work left join job on work.job = job.name; out/left_join.out | person | job | name | billable | |--------|-----------|-----------|----------| | mik | calibrate | calibrate | 1.5 | | mik | clean | clean | 0.5 | | mik | complain | | | | po | clean | clean | 0.5 | | po | complain | | | | tay | complain | | | * A left outer join keeps all rows from the left table * Fills missing values from right table with null 35: aggregate left joins src/aggregate_left_join.sql select work.person, sum(job.billable) as pay from work left join job on work.job = job.name group by work.person; out/aggregate_left_join.out | person | pay | |--------|-----| | mik | 2.0 | | po | 0.5 | | tay | | * That's better, but we'd like to see 0 rather than a blank check your understanding concept map: join 36: coalesce values src/coalesce.sql select work.person, coalesce(sum(job.billable), 0.0) as pay from work left join job on work.job = job.name group by work.person; out/coalesce.out | person | pay | |--------|-----| | mik | 2.0 | | po | 0.5 | | tay | 0.0 | * coalesce(val1, val2, ...) returns first non-null value 37: negate incorrectly * Who doesn't calibrate? src/negate_incorrectly.sql select distinct person from work where job != 'calibrate'; out/negate_incorrectly.out | person | |--------| | mik | | po | | tay | * But Mik does calibrate * Problem is that there's an entry for Mik cleaning * And since 'clean' != 'calibrate', that row is included in the results * We need a different approach 38: set membership src/set_membership.sql select * from work where person not in ('mik', 'tay'); out/set_membership.out | person | job | |--------|----------| | po | clean | | po | complain | * in values and not in values do exactly what you expect 39: subqueries src/subquery_set.sql select distinct person from work where person not in ( select distinct person from work where job = 'calibrate' ); out/subquery_set.out | person | |--------| | po | | tay | * Use a subquery to select the people who do calibrate * Then select all the people who aren't in that set * Initially feels odd, but subqueries are useful in other ways M to N relationships * Relationships between entities are usually characterized as: + 1-to-1: fields in the same record + 1-to-many: the many have a foreign key referring to the one's primary key + many-to-many: don't know how many keys to add to records ("maximum" never is) * Nearly-universal solution is a join table + Each record is a pair of foreign keys + I.e., each record is the fact that records A and B are related 40: autoincrement and primary key src/autoincrement.sql create table person ( ident integer primary key autoincrement, name text not null ); insert into person values (null, 'mik'), (null, 'po'), (null, 'tay'); select * from person; insert into person values (1, "prevented"); out/autoincrement.out | ident | name | |-------|------| | 1 | mik | | 2 | po | | 3 | tay | Runtime error near line 12: UNIQUE constraint failed: person.ident (19) * Database autoincrements ident each time a new record is added * Use that field as the primary key + So that if Mik changes their name again, we only have to change one fact in the database + Downside: manual queries are harder to read (who is person 17?) internal tables src/sequence_table.sql select * from sqlite_sequence; out/sequence_table.out | name | seq | |--------|-----| | person | 3 | * Sequence numbers are not reset when rows are deleted 41: alter tables src/alter_tables.sql alter table job add ident integer not null default -1; update job set ident = 1 where name = 'calibrate'; update job set ident = 2 where name = 'clean'; select * from job; out/alter_tables.out | name | billable | ident | |-----------|----------|-------| | calibrate | 1.5 | 1 | | clean | 0.5 | 2 | * Add a column after the fact * Since it can't be null, we have to provide a default value + Really want to make it the primary key, but SQLite doesn't allow that (easily) after the fact * Then use update to modify existing records + Can modify any number of records at once + So be careful about where clause * Data migration 42: create new tables from old src/insert_select.sql create table new_work ( person_id integer not null, job_id integer not null, foreign key(person_id) references person(ident), foreign key(job_id) references job(ident) ); insert into new_work select person.ident as person_id, job.ident as job_id from (person join work on person.name = work.person) join job on job.name = work.job; select * from new_work; out/insert_select.out | person_id | job_id | |-----------|--------| | 1 | 1 | | 1 | 2 | | 2 | 2 | * new_work is our join table * Each column refers to a record in some other table 43: remove tables src/drop_table.sql drop table work; alter table new_work rename to work; out/drop_table.out CREATE TABLE job ( ident integer primary key autoincrement, name text not null, billable real not null ); CREATE TABLE sqlite_sequence(name,seq); CREATE TABLE person ( ident integer primary key autoincrement, name text not null ); CREATE TABLE IF NOT EXISTS "work" ( person_id integer not null, job_id integer not null, foreign key(person_id) references person(ident), foreign key(job_id) references job(ident) ); * Remove the old table and rename the new one to take its place + Note if exists * Be careful... 44: compare individual values to aggregates * Go back to penguins src/compare_individual_aggregate.sql select body_mass_g from penguins where body_mass_g > ( select avg(body_mass_g) from penguins ) limit 5; out/compare_individual_aggregate.out | body_mass_g | |-------------| | 4675 | | 4250 | | 4400 | | 4500 | | 4650 | * Get average body mass in subquery * Compare each row against that * Requires two scans of the data, but there's no way to avoid that * Null values aren't included in the average or in the final results 45: compare individual values to aggregates within groups src/compare_within_groups.sql select penguins.species, penguins.body_mass_g, round(averaged.avg_mass_g, 1) as avg_mass_g from penguins join ( select species, avg(body_mass_g) as avg_mass_g from penguins group by species ) as averaged on penguins.species = averaged.species where penguins.body_mass_g > averaged.avg_mass_g limit 5; out/compare_within_groups.out | species | body_mass_g | avg_mass_g | |---------|-------------|------------| | Adelie | 3750 | 3700.7 | | Adelie | 3800 | 3700.7 | | Adelie | 4675 | 3700.7 | | Adelie | 4250 | 3700.7 | | Adelie | 3800 | 3700.7 | * Subquery runs first to create temporary table averaged with average mass per species * Join that with penguins * Filter to find penguins heavier than average within their species 46: common table expressions src/common_table_expressions.sql with grouped as ( select species, avg(body_mass_g) as avg_mass_g from penguins group by species ) select penguins.species, penguins.body_mass_g, round(grouped.avg_mass_g, 1) as avg_mass_g from penguins join grouped where penguins.body_mass_g > grouped.avg_mass_g limit 5; out/common_table_expressions.out | species | body_mass_g | avg_mass_g | |---------|-------------|------------| | Adelie | 3750 | 3700.7 | | Adelie | 3800 | 3700.7 | | Adelie | 4675 | 3700.7 | | Adelie | 4250 | 3700.7 | | Adelie | 3800 | 3700.7 | * Use common table expression (CTE) to make queries clearer + Nested subqueries quickly become difficult to understand * Database decides how to optimize explain query plan src/explain_query_plan.sql explain query plan select species, avg(body_mass_g) from penguins group by species; out/explain_query_plan.out QUERY PLAN |--SCAN penguins `--USE TEMP B-TREE FOR GROUP BY * SQLite plans to scan every row of the table * It will build a temporary B-tree data structure to group rows 47: enumerate rows * Every table has a special column called rowid src/rowid.sql select rowid, species, island from penguins limit 5; out/rowid.out | rowid | species | island | |-------|---------|-----------| | 1 | Adelie | Torgersen | | 2 | Adelie | Torgersen | | 3 | Adelie | Torgersen | | 4 | Adelie | Torgersen | | 5 | Adelie | Torgersen | * rowid is persistent within a session + I.e., if we delete the first 5 rows we now have row IDs 6...N * Do not rely on row ID + In particular, do not use it as a key 48: if-else function src/if_else.sql with sized_penguins as ( select species, iif( body_mass_g < 3500, 'small', 'large' ) as size from penguins ) select species, size, count(*) as num from sized_penguins group by species, size order by species, num; out/if_else.out | species | size | num | |-----------|-------|-----| | Adelie | small | 54 | | Adelie | large | 98 | | Chinstrap | small | 17 | | Chinstrap | large | 51 | | Gentoo | large | 124 | * iif(condition, true_result, false_result) + Note: iif with two i's 49: select a case * What if we want small, medium, and large? * Can nest iif, but quickly becomes unreadable src/case_when.sql with sized_penguins as ( select species, case when body_mass_g < 3500 then 'small' when body_mass_g < 5000 then 'medium' else 'large' end as size from penguins ) select species, size, count(*) as num from sized_penguins group by species, size order by species, num; out/case_when.out | species | size | num | |-----------|--------|-----| | Adelie | large | 1 | | Adelie | small | 54 | | Adelie | medium | 97 | | Chinstrap | small | 17 | | Chinstrap | medium | 51 | | Gentoo | medium | 56 | | Gentoo | large | 68 | * Evaluate when options in order and take first * Result of case is null if no condition is true * Use else as fallback 50: check range src/check_range.sql with sized_penguins as ( select species, case when body_mass_g between 3500 and 5000 then 'normal' else 'abnormal' end as size from penguins ) select species, size, count(*) as num from sized_penguins group by species, size order by species, num; out/check_range.out | species | size | num | |-----------|----------|-----| | Adelie | abnormal | 55 | | Adelie | normal | 97 | | Chinstrap | abnormal | 17 | | Chinstrap | normal | 51 | | Gentoo | abnormal | 62 | | Gentoo | normal | 62 | * between can make queries easier to read * But be careful of the and in the middle yet another database * Entity-relationship diagram (ER diagram) shows relationships between tables * Like everything to do with databases, there are lots of variations assay database table diagram assay ER diagram src/assay_staff.sql select * from staff; out/assay_staff.out | ident | personal | family | dept | age | |-------|----------|-----------|------|-----| | 1 | Kartik | Gupta | | 46 | | 2 | Divit | Dhaliwal | hist | 34 | | 3 | Indrans | Sridhar | mb | 47 | | 4 | Pranay | Khanna | mb | 51 | | 5 | Riaan | Dua | | 23 | | 6 | Vedika | Rout | hist | 45 | | 7 | Abram | Chokshi | gen | 23 | | 8 | Romil | Kapoor | hist | 38 | | 9 | Ishaan | Ramaswamy | mb | 35 | | 10 | Nitya | Lal | gen | 52 | 51: pattern matching src/like_glob.sql select personal, family from staff where personal like '%ya%' or family glob '*De*'; out/like_glob.out | personal | family | |----------|--------| | Nitya | Lal | * like is the original SQL pattern matcher + % matches zero or more characters at the start or end of a string + Case insensitive by default * glob supports Unix-style wildcards name purpose substr Get substring given starting point and length trim Remove characters from beginning and end of string ltrim Remove characters from beginning of string rtrim Remove characters from end of string length Length of string replace Replace occurrences of substring with another string upper Return upper-case version of string lower Return lower-case version of string instr Find location of first occurrence of substring (returns 0 if not found) 52: select first and last rows src/union_all.sql select * from ( select * from (select * from experiment order by started asc limit 5) union all select * from (select * from experiment order by started desc limit 5) ) order by started asc ; out/union_all.out | ident | kind | started | ended | |-------|-------------|------------|------------| | 17 | trial | 2023-01-29 | 2023-01-30 | | 35 | calibration | 2023-01-30 | 2023-01-30 | | 36 | trial | 2023-02-02 | 2023-02-03 | | 25 | trial | 2023-02-12 | 2023-02-14 | | 2 | calibration | 2023-02-14 | 2023-02-14 | | 40 | calibration | 2024-01-21 | 2024-01-21 | | 12 | trial | 2024-01-26 | 2024-01-28 | | 44 | trial | 2024-01-27 | 2024-01-29 | | 34 | trial | 2024-02-01 | 2024-02-02 | | 14 | calibration | 2024-02-03 | 2024-02-03 | * union all combines records + Keeps duplicates: union on its own keeps unique records * Yes, it feels like the extra select * from should be unnecessary 53: intersection src/intersect.sql select personal, family, dept, age from staff where dept = 'mb' intersect select personal, family, dept, age from staff where age < 50 ; out/intersect.out | personal | family | dept | age | |----------|-----------|------|-----| | Indrans | Sridhar | mb | 47 | | Ishaan | Ramaswamy | mb | 35 | * Tables being intersected must have same structure * Intersection usually used when pulling values from different tables + In this case, would be clearer to use where 54: exclusion src/except.sql select personal, family, dept, age from staff where dept = 'mb' except select personal, family, dept, age from staff where age < 50 ; out/except.out | personal | family | dept | age | |----------|--------|------|-----| | Pranay | Khanna | mb | 51 | * Again, tables must have same structure + And this would be clearer with where * SQL operates on sets, not tables, except where it doesn't 55: random numbers and why not src/random_numbers.sql with decorated as ( select random() as rand, personal || ' ' || family as name from staff ) select rand, abs(rand) % 10 as selector, name from decorated where selector < 5; out/random_numbers.out | rand | selector | name | |----------------------|----------|-----------------| | 7176652035743196310 | 0 | Divit Dhaliwal | | -2243654635505630380 | 2 | Indrans Sridhar | | -6940074802089166303 | 5 | Pranay Khanna | | 8882650891091088193 | 9 | Riaan Dua | | -45079732302991538 | 5 | Vedika Rout | | -8973877087806386134 | 2 | Abram Chokshi | | 3360598450426870356 | 9 | Romil Kapoor | * There is no way to seed SQLite's random number generator * Which means there is no way to reproduce one of its "random" sequences 56: creating index src/create_use_index.sql explain query plan select filename from plate where filename like '%07%'; create index plate_file on plate(filename); explain query plan select filename from plate where filename like '%07%'; out/create_use_index.out QUERY PLAN `--SCAN plate USING COVERING INDEX sqlite_autoindex_plate_1 QUERY PLAN `--SCAN plate USING COVERING INDEX plate_file * An index is an auxiliary data structure that enables faster access to records + Spend storage space to buy speed * Don't have to mention it explicitly in queries + Database manager will use it automatically 57: generate sequence src/generate_sequence.sql select value from generate_series(1, 5); out/generate_sequence.out | value | |-------| | 1 | | 2 | | 3 | | 4 | | 5 | * A (non-standard) table-valued function 58: generate sequence based on data src/data_range_sequence.sql create table temp ( num integer not null ); insert into temp values (1), (5); select value from generate_series ( (select min(num) from temp), (select max(num) from temp) ); out/data_range_sequence.out | value | |-------| | 1 | | 2 | | 3 | | 4 | | 5 | * Must have the parentheses around the min and max selections to keep SQLite happy 59: generate sequence of dates src/date_sequence.sql select date((select julianday(min(started)) from experiment) + value) as some_day from ( select value from generate_series( (select 0), (select count(*) - 1 from experiment) ) ) limit 5; out/date_sequence.out | some_day | |------------| | 2023-01-29 | | 2023-01-30 | | 2023-01-31 | | 2023-02-01 | | 2023-02-02 | * SQLite represents dates as YYYY-MM-DD strings or as Julian days or as Unix milliseconds or... + Julian days is fractional number of days since November 24, 4714 BCE * julianday and date convert back and forth 60: count experiments started per day without gaps src/experiments_per_day.sql with -- complete sequence of days with 0 as placeholder for number of experiments all_days as ( select date((select julianday(min(started)) from experiment) + value) as some_day, 0 as zeroes from ( select value from generate_series( (select 0), (select count(*) - 1 from experiment) ) ) ), -- sequence of actual days with actual number of experiments started actual_days as ( select started, count(started) as num_exp from experiment group by started ) -- combined by joining on day and taking actual number (if available) or zero select all_days.some_day as day, coalesce(actual_days.num_exp, all_days.zeroes) as num_exp from all_days left join actual_days on all_days.some_day = actual_days.started limit 5 ; out/experiments_per_day.out | day | num_exp | |------------|---------| | 2023-01-29 | 1 | | 2023-01-30 | 1 | | 2023-01-31 | 0 | | 2023-02-01 | 0 | | 2023-02-02 | 1 | 61: self join src/self_join.sql with person as ( select ident, personal || ' ' || family as name from staff ) select left.name, right.name from person as left join person as right limit 10; out/self_join.out | name | name | |--------------|------------------| | Kartik Gupta | Kartik Gupta | | Kartik Gupta | Divit Dhaliwal | | Kartik Gupta | Indrans Sridhar | | Kartik Gupta | Pranay Khanna | | Kartik Gupta | Riaan Dua | | Kartik Gupta | Vedika Rout | | Kartik Gupta | Abram Chokshi | | Kartik Gupta | Romil Kapoor | | Kartik Gupta | Ishaan Ramaswamy | | Kartik Gupta | Nitya Lal | * Join a table to itself + Use as to create aliases for copies of tables to distinguish them + Nothing special about the names left and right * Get all n^2 pairs, including person with themself 62: generate unique pairs src/unique_pairs.sql with person as ( select ident, personal || ' ' || family as name from staff ) select left.name, right.name from person as left join person as right on left.ident < right.ident where left.ident <= 4 and right.ident <= 4; out/unique_pairs.out | name | name | |-----------------|-----------------| | Kartik Gupta | Divit Dhaliwal | | Kartik Gupta | Indrans Sridhar | | Kartik Gupta | Pranay Khanna | | Divit Dhaliwal | Indrans Sridhar | | Divit Dhaliwal | Pranay Khanna | | Indrans Sridhar | Pranay Khanna | * left.ident < right.ident ensures distinct pairs without duplicates * Use left.ident <= 4 and right.ident <= 4 to limit output * Quick check: n(n-1)/2 pairs 63: filter pairs src/filter_pairs.sql with person as ( select ident, personal || ' ' || family as name from staff ), together as ( select left.staff as left_staff, right.staff as right_staff from performed as left join performed as right on left.experiment = right.experiment where left_staff < right_staff ) select left.name as person_1, right.name as person_2 from person as left join person as right join together on left.ident = left_staff and right.ident = right_staff; out/filter_pairs.out | person_1 | person_2 | |-----------------|------------------| | Kartik Gupta | Vedika Rout | | Pranay Khanna | Vedika Rout | | Indrans Sridhar | Romil Kapoor | | Abram Chokshi | Ishaan Ramaswamy | | Pranay Khanna | Vedika Rout | | Kartik Gupta | Abram Chokshi | | Abram Chokshi | Romil Kapoor | | Kartik Gupta | Divit Dhaliwal | | Divit Dhaliwal | Abram Chokshi | | Pranay Khanna | Ishaan Ramaswamy | | Indrans Sridhar | Romil Kapoor | | Kartik Gupta | Ishaan Ramaswamy | | Kartik Gupta | Nitya Lal | | Kartik Gupta | Abram Chokshi | | Pranay Khanna | Romil Kapoor | 64: existence and correlated subqueries src/correlated_subquery.sql select name, building from department where exists ( select 1 from staff where dept = department.ident ) order by name; out/correlated_subquery.out | name | building | |-------------------|------------------| | Genetics | Chesson | | Histology | Fashet Extension | | Molecular Biology | Chesson | * Nobody works in Endocrinology * select 1 could equally be select true or any other value * A correlated subquery depends on a value from the outer query + Equivalent to nested loop 65: nonexistence src/nonexistence.sql select name, building from department where not exists ( select 1 from staff where dept = department.ident ) order by name; out/nonexistence.out | name | building | |---------------|----------| | Endocrinology | TGVH | avoiding correlated subqueries src/avoid_correlated_subqueries.sql select distinct department.name as name, department.building as building from department join staff on department.ident = staff.dept order by name; out/avoid_correlated_subqueries.out | name | building | |-------------------|------------------| | Genetics | Chesson | | Histology | Fashet Extension | | Molecular Biology | Chesson | * The join might or might not be faster than the correlated subquery * Hard to find unstaffed departments without either not exists or count and a check for 0 66: lead and lag src/lead_lag.sql with ym_num as ( select strftime('%Y-%m', started) as ym, count(*) as num from experiment group by ym ) select ym, lag(num) over (order by ym) as prev_num, num, lead(num) over (order by ym) as next_num from ym_num order by ym; out/lead_lag.out | ym | prev_num | num | next_num | |---------|----------|-----|----------| | 2023-01 | | 2 | 5 | | 2023-02 | 2 | 5 | 5 | | 2023-03 | 5 | 5 | 1 | | 2023-04 | 5 | 1 | 6 | | 2023-05 | 1 | 6 | 5 | | 2023-06 | 6 | 5 | 3 | | 2023-07 | 5 | 3 | 2 | | 2023-08 | 3 | 2 | 4 | | 2023-09 | 2 | 4 | 6 | | 2023-10 | 4 | 6 | 4 | | 2023-12 | 6 | 4 | 5 | | 2024-01 | 4 | 5 | 2 | | 2024-02 | 5 | 2 | | * Use strftime to extract year and month + Clumsy, but date/time handling is not SQLite's strong point * Use window functions lead and lag to shift values + Unavailable values are null 67: window functions src/window_functions.sql with ym_num as ( select strftime('%Y-%m', started) as ym, count(*) as num from experiment group by ym ) select ym, num, sum(num) over (order by ym) as num_done, cume_dist() over (order by ym) as progress from ym_num order by ym; out/window_functions.out | ym | num | num_done | progress | |---------|-----|----------|--------------------| | 2023-01 | 2 | 2 | 0.0769230769230769 | | 2023-02 | 5 | 7 | 0.153846153846154 | | 2023-03 | 5 | 12 | 0.230769230769231 | | 2023-04 | 1 | 13 | 0.307692307692308 | | 2023-05 | 6 | 19 | 0.384615384615385 | | 2023-06 | 5 | 24 | 0.461538461538462 | | 2023-07 | 3 | 27 | 0.538461538461538 | | 2023-08 | 2 | 29 | 0.615384615384615 | | 2023-09 | 4 | 33 | 0.692307692307692 | | 2023-10 | 6 | 39 | 0.769230769230769 | | 2023-12 | 4 | 43 | 0.846153846153846 | | 2024-01 | 5 | 48 | 0.923076923076923 | | 2024-02 | 2 | 50 | 1.0 | * sum() over does a running total * cume_dist is fraction of rows seen so far explain another query plain src/explain_window_function.sql explain query plan with ym_num as ( select strftime('%Y-%m', started) as ym, count(*) as num from experiment group by ym ) select ym, num, sum(num) over (order by ym) as num_done, cume_dist() over (order by ym) as progress from ym_num order by ym; out/explain_window_function.out QUERY PLAN |--CO-ROUTINE (subquery-3) | |--CO-ROUTINE (subquery-4) | | |--CO-ROUTINE ym_num | | | |--SCAN experiment | | | `--USE TEMP B-TREE FOR GROUP BY | | |--SCAN ym_num | | `--USE TEMP B-TREE FOR ORDER BY | `--SCAN (subquery-4) `--SCAN (subquery-3) * Becomes useful...eventually 68: partitioned windows src/partition_window.sql with y_m_num as ( select strftime('%Y', started) as year, strftime('%m', started) as month, count(*) as num from experiment group by year, month ) select year, month, num, sum(num) over (partition by year order by month) as num_done from y_m_num order by year, month; out/partition_window.out | year | month | num | num_done | |------|-------|-----|----------| | 2023 | 01 | 2 | 2 | | 2023 | 02 | 5 | 7 | | 2023 | 03 | 5 | 12 | | 2023 | 04 | 1 | 13 | | 2023 | 05 | 6 | 19 | | 2023 | 06 | 5 | 24 | | 2023 | 07 | 3 | 27 | | 2023 | 08 | 2 | 29 | | 2023 | 09 | 4 | 33 | | 2023 | 10 | 6 | 39 | | 2023 | 12 | 4 | 43 | | 2024 | 01 | 5 | 5 | | 2024 | 02 | 2 | 7 | * partition by creates groups * So this counts experiments started since the beginning of each year 69: blobs src/blob.sql create table images ( name text not null, content blob ); insert into images(name, content) values ("biohazard", readfile("img/biohazard.png")), ("crush", readfile("img/crush.png")), ("fire", readfile("img/fire.png")), ("radioactive", readfile("img/radioactive.png")), ("tripping", readfile("img/tripping.png")); out/blob.out | name | length(content) | |-------------|-----------------| | biohazard | 19629 | | crush | 15967 | | fire | 18699 | | radioactive | 16661 | | tripping | 17208 | * A blob is a binary large object + Bytes in, bytes out... * If you think that's odd, check out Fossil yet another database src/lab_log_db.sh sqlite3 data/lab_log.db src/lab_log_schema.sql .schema out/lab_log_schema.out CREATE TABLE sqlite_sequence(name,seq); CREATE TABLE person( ident integer primary key autoincrement, details text not null ); CREATE TABLE machine( ident integer primary key autoincrement, name text not null, details text not null ); CREATE TABLE usage( ident integer primary key autoincrement, log text not null ); 70: store JSON src/json_in_table.sql select * from machine; out/json_in_table.out | ident | name | details | |-------|----------------|---------------------------------------------------------| | 1 | WY401 | {"acquired": "2023-05-01"} | | 2 | Inphormex | {"acquired": "2021-07-15", "refurbished": "2023-10-22"} | | 3 | AutoPlate 9000 | {"note": "needs software update"} | * Store heterogeneous data as JSON-formatted text (with double-quoted strings) + Database parses it each time it is queried * Alternatively store as blob + Can't just view it + But more efficient 71: select field from JSON src/json_field.sql select details->'$.acquired' as single_arrow, details->>'$.acquired' as double_arrow from machine; out/json_field.out | single_arrow | double_arrow | |--------------|--------------| | "2023-05-01" | 2023-05-01 | | "2021-07-15" | 2021-07-15 | | | | * Single arrow -> returns JSON representation result * Double arrow ->> returns SQL text, integer, real, or null * Left side is column * Right side is path expression + Start with $ (meaning "root") + Fields separated by . 72: JSON array access src/json_array.sql select ident, json_array_length(log->'$') as length, log->'$[0]' as first from usage; out/json_array.out | ident | length | first | |-------|--------|--------------------------------------------------------------| | 1 | 4 | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]} | | 2 | 5 | {"machine":"Inphormex","person":["Marianne","Richer"]} | | 3 | 2 | {"machine":"sterilizer","person":["Josette","Villeneuve"]} | | 4 | 1 | {"machine":"sterilizer","person":["Maude","Goulet"]} | | 5 | 2 | {"machine":"AutoPlate 9000","person":["Brigitte","Michaud"]} | | 6 | 1 | {"machine":"sterilizer","person":["Marianne","Richer"]} | | 7 | 3 | {"machine":"WY401","person":["Maude","Goulet"]} | | 8 | 1 | {"machine":"AutoPlate 9000"} | * SQLite (and other database managers) has lots of JSON manipulation functions * json_array_length gives number of elements in selected array * subscripts start with 0 * Characters outside 7-bit ASCII represented as Unicode escapes 73: unpack JSON array src/json_unpack.sql select ident, json_each.key as key, json_each.value as value from usage, json_each(usage.log) limit 10; out/json_unpack.out | ident | key | value | |-------|-----|--------------------------------------------------------------| | 1 | 0 | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]} | | 1 | 1 | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]} | | 1 | 2 | {"machine":"WY401","person":["Gabrielle","Dub\u00e9"]} | | 1 | 3 | {"machine":"Inphormex","person":["Gabrielle","Dub\u00e9"]} | | 2 | 0 | {"machine":"Inphormex","person":["Marianne","Richer"]} | | 2 | 1 | {"machine":"AutoPlate 9000","person":["Marianne","Richer"]} | | 2 | 2 | {"machine":"sterilizer","person":["Marianne","Richer"]} | | 2 | 3 | {"machine":"AutoPlate 9000","person":["Monique","Marcotte"]} | | 2 | 4 | {"machine":"sterilizer","person":["Marianne","Richer"]} | | 3 | 0 | {"machine":"sterilizer","person":["Josette","Villeneuve"]} | * json_each is another table-valued function * Use json_each.name to get properties of unpacked array 74: last element of array src/json_array_last.sql select ident, log->'$[#-1].machine' as final from usage limit 5; out/json_array_last.out | ident | final | |-------|--------------| | 1 | "Inphormex" | | 2 | "sterilizer" | | 3 | "Inphormex" | | 4 | "sterilizer" | | 5 | "sterilizer" | 75: modify JSON src/json_modify.sql select ident, name, json_set(details, '$.sold', json_quote('2024-01-25')) as updated from machine; out/json_modify.out | ident | name | updated | |-------|----------------|--------------------------------------------------------------| | 1 | WY401 | {"acquired":"2023-05-01","sold":"2024-01-25"} | | 2 | Inphormex | {"acquired":"2021-07-15","refurbished":"2023-10-22","sold":" | | | | 2024-01-25"} | | 3 | AutoPlate 9000 | {"note":"needs software update","sold":"2024-01-25"} | * Updates the in-memory copy of the JSON, not the database record * Please use json_quote rather than trying to format JSON with string operations refresh penguins src/count_penguins.sql select species, count(*) as num from penguins group by species; out/count_penguins.out | species | num | |-----------|-----| | Adelie | 152 | | Chinstrap | 68 | | Gentoo | 124 | * We will restore full database after each example 76: tombstones src/make_active.sql alter table penguins add active integer not null default 1; update penguins set active = iif(species = 'Adelie', 0, 1); src/active_penguins.sql select species, count(*) as num from penguins where active group by species; out/active_penguins.out | species | num | |-----------|-----| | Chinstrap | 68 | | Gentoo | 124 | * Use a tombstone to mark (in)active records * Every query must now include it 77: views src/views.sql create view if not exists active_penguins ( species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex ) as select species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex from penguins where active; select species, count(*) as num from active_penguins group by species; out/views.out | species | num | |-----------|-----| | Chinstrap | 68 | | Gentoo | 124 | * A view is a saved query that other queries can invoke * View is re-run each time it's used * Like a CTE, but: + Can be shared between queries + Views came first * Some databases offer materialized views + Update-on-demand temporary tables hours reminder src/all_jobs.sql create table job ( name text not null, billable real not null ); insert into job values ('calibrate', 1.5), ('clean', 0.5); select * from job; out/all_jobs.out | name | billable | |-----------|----------| | calibrate | 1.5 | | clean | 0.5 | 78: add check src/all_jobs_check.sql create table job ( name text not null, billable real not null, check (billable > 0.0) ); insert into job values ('calibrate', 1.5); insert into job values ('reset', -0.5); select * from job; out/all_jobs_check.out Runtime error near line 9: CHECK constraint failed: billable > 0.0 (19) | name | billable | |-----------|----------| | calibrate | 1.5 | * check adds constraint to table + Must produce a Boolean result + Run each time values added or modified * But changes made before the error have taken effect ACID * Atomic: change cannot be broken down into smaller ones (i.e., all or nothing) * Consistent: database goes from one consistent state to another * Isolated: looks like changes happened one after another * Durable: if change takes place, it's still there after a restart 79: transactions src/transaction.sql create table job ( name text not null, billable real not null, check (billable > 0.0) ); insert into job values ('calibrate', 1.5); begin transaction; insert into job values ('clean', 0.5); rollback; select * from job; out/transaction.out | name | billable | |-----------|----------| | calibrate | 1.5 | * Statements outside transaction execute and are committed immediately * Statement(s) inside transaction don't take effect until: + end transaction (success) + rollback (undo) * Can have any number of statements inside a transaction * But cannot nest transactions in SQLite + Other databases support this 80: rollback in constraint src/rollback_constraint.sql create table job ( name text not null, billable real not null, check (billable > 0.0) on conflict rollback ); insert into job values ('calibrate', 1.5); insert into job values ('clean', 0.5), ('reset', -0.5); select * from job; out/rollback_constraint.out Runtime error near line 11: CHECK constraint failed: billable > 0.0 (19) | name | billable | |-----------|----------| | calibrate | 1.5 | * All of second insert rolled back as soon as error occurred * But first insert took effect 81: rollback in statement src/rollback_statement.sql create table job ( name text not null, billable real not null, check (billable > 0.0) ); insert or rollback into job values ('calibrate', 1.5); insert or rollback into job values ('clean', 0.5), ('reset', -0.5); select * from job; out/rollback_statement.out Runtime error near line 11: CHECK constraint failed: billable > 0.0 (19) | name | billable | |-----------|----------| | calibrate | 1.5 | * Constraint is in table definition * Action is in statement 82: upsert src/upsert.sql create table jobs_done ( person text unique, num integer default 0 ); insert into jobs_done values("zia", 1); .print "after first" select * from jobs_done; .print insert into jobs_done values("zia", 1); .print "after failed" select * from jobs_done; insert into jobs_done values("zia", 1) on conflict(person) do update set num = num + 1; .print "\nafter upsert" select * from jobs_done; out/upsert.out after first | person | num | |--------|-----| | zia | 1 | Runtime error near line 14: UNIQUE constraint failed: jobs_done.person (19) after failed | person | num | |--------|-----| | zia | 1 | after upsert | person | num | |--------|-----| | zia | 2 | * upsert stands for "update or insert" + Create if record doesn't exist + Update if it does * Not standard SQL but widely implemented * Example also shows use of SQLite .print command normalization * First normal form (1NF): every field of every record contains one indivisible value. * Second normal form (2NF) and third normal form (3NF): every value in a record that isn't a key depends solely on the key, not on other values. * Denormalization: explicitly store values that could be calculated on the fly + To simplify queries and/or make processing faster 83: create trigger * A trigger automatically runs before or after a specified operation * Can have side effects (e.g., update some other table) * And/or implement checks (e.g., make sure other records exist) * Add processing overhead... * ...but data is either cheap or correct, never both * Inside trigger, refer to old and new versions of record as old. column and new.column src/trigger_setup.sql -- Track hours of lab work. create table job ( person text not null, reported real not null check (reported >= 0.0) ); -- Explicitly store per-person total rather than using sum(). create table total ( person text unique not null, hours real ); -- Initialize totals. insert into total values ("gene", 0.0), ("august", 0.0); -- Define a trigger. create trigger total_trigger before insert on job begin -- Check that the person exists. select case when not exists (select 1 from total where person = new.person) then raise(rollback, 'Unknown person ') end; -- Update their total hours (or fail if non-negative constraint violated). update total set hours = hours + new.reported where total.person = new.person; end; src/trigger_successful.sql insert into job values ('gene', 1.5), ('august', 0.5), ('gene', 1.0) ; out/trigger_successful.out | person | reported | |--------|----------| | gene | 1.5 | | august | 0.5 | | gene | 1.0 | | person | hours | |--------|-------| | gene | 2.5 | | august | 0.5 | 081: trigger firing src/trigger_firing.sql insert into job values ('gene', 1.0), ('august', -1.0) ; out/trigger_firing.out Runtime error near line 6: CHECK constraint failed: reported >= 0.0 (19) | person | hours | |--------|-------| | gene | 0.0 | | august | 0.0 | represent graphs src/lineage_setup.sql create table lineage ( parent text not null, child text not null ); insert into lineage values ('Arturo', 'Clemente'), ('Dario', 'Clemente'), ('Clemente', 'Homero'), ('Clemente', 'Ivonne'), ('Ivonne', 'Lourdes'), ('Soledad', 'Lourdes'), ('Lourdes', 'Santiago'); src/represent_graph.sql select * from lineage; out/represent_graph.out | parent | child | |----------|----------| | Arturo | Clemente | | Dario | Clemente | | Clemente | Homero | | Clemente | Ivonne | | Ivonne | Lourdes | | Soledad | Lourdes | | Lourdes | Santiago | lineage diagram 84: recursive query src/recursive_lineage.sql with recursive descendent as ( select 'Clemente' as person, 0 as generations union all select lineage.child as person, descendent.generations + 1 as generations from descendent join lineage on descendent.person = lineage.parent ) select person, generations from descendent; out/recursive_lineage.out | person | generations | |----------|-------------| | Clemente | 0 | | Homero | 1 | | Ivonne | 1 | | Lourdes | 2 | | Santiago | 3 | * Use a recursive CTE to create a temporary table (descendent) * Base case seeds this table * Recursive case relies on value(s) already in that table and external table(s) * union all to combine rows + Can use union but that has lower performance (must check uniqueness each time) * Stops when the recursive case yields an empty row set (nothing new to add) * Then select the desired values from the CTE contact tracing database src/contact_person.sql select * from person; out/contact_person.out | ident | name | |-------|-----------------------| | 1 | Juana Baeza | | 2 | Agustin Rodriquez | | 3 | Ariadna Caraballo | | 4 | Micaela Laboy | | 5 | Veronica Altamirano | | 6 | Reina Rivero | | 7 | Elias Merino | | 8 | Minerva Guerrero | | 9 | Mauro Balderas | | 10 | Pilar Alarcon | | 11 | Daniela Menendez | | 12 | Marco Antonio Barrera | | 13 | Cristal Soliz | | 14 | Bernardo Narvaez | | 15 | Oscar Barrios | src/contact_contacts.sql select * from contact; out/contact_contacts.out | left | right | |-------------------|-----------------------| | Agustin Rodriquez | Ariadna Caraballo | | Agustin Rodriquez | Veronica Altamirano | | Juana Baeza | Veronica Altamirano | | Juana Baeza | Micaela Laboy | | Pilar Alarcon | Reina Rivero | | Cristal Soliz | Marco Antonio Barrera | | Cristal Soliz | Daniela Menendez | | Daniela Menendez | Marco Antonio Barrera | contact diagram 85: bidirectional contacts src/bidirectional.sql create temporary table bi_contact ( left text, right text ); insert into bi_contact select left, right from contact union all select right, left from contact ; out/bidirectional.out | original_count | |----------------| | 8 | | num_contact | |-------------| | 16 | * Create a temporary table rather than using a long chain of CTEs + Only lasts as long as the session (not saved to disk) * Duplicate information rather than writing more complicated query 86: update group identifiers src/update_group_ids.sql select left.name as left_name, left.ident as left_ident, right.name as right_name, right.ident as right_ident, min(left.ident, right.ident) as new_ident from (person as left join bi_contact on left.name = bi_contact.left) join person as right on bi_contact.right = right.name; out/update_group_ids.out | left_name | left_ident | right_name | right_ident | new_ident | |-----------------------|------------|-----------------------|-------------|-----------| | Juana Baeza | 1 | Micaela Laboy | 4 | 1 | | Juana Baeza | 1 | Veronica Altamirano | 5 | 1 | | Agustin Rodriquez | 2 | Ariadna Caraballo | 3 | 2 | | Agustin Rodriquez | 2 | Veronica Altamirano | 5 | 2 | | Ariadna Caraballo | 3 | Agustin Rodriquez | 2 | 2 | | Micaela Laboy | 4 | Juana Baeza | 1 | 1 | | Veronica Altamirano | 5 | Agustin Rodriquez | 2 | 2 | | Veronica Altamirano | 5 | Juana Baeza | 1 | 1 | | Reina Rivero | 6 | Pilar Alarcon | 10 | 6 | | Pilar Alarcon | 10 | Reina Rivero | 6 | 6 | | Daniela Menendez | 11 | Cristal Soliz | 13 | 11 | | Daniela Menendez | 11 | Marco Antonio Barrera | 12 | 11 | | Marco Antonio Barrera | 12 | Cristal Soliz | 13 | 12 | | Marco Antonio Barrera | 12 | Daniela Menendez | 11 | 11 | | Cristal Soliz | 13 | Daniela Menendez | 11 | 11 | | Cristal Soliz | 13 | Marco Antonio Barrera | 12 | 12 | * new_ident is minimum of own identifier and identifiers one step away * Doesn't keep people with no contacts 87: recursive labeling src/recursive_labeling.sql with recursive labeled as ( select person.name as name, person.ident as label from person union -- not 'union all' select person.name as name, labeled.label as label from (person join bi_contact on person.name = bi_contact.left) join labeled on bi_contact.right = labeled.name where labeled.label < person.ident ) select name, min(label) as group_id from labeled group by name order by label, name; out/recursive_labeling.out | name | group_id | |-----------------------|----------| | Agustin Rodriquez | 1 | | Ariadna Caraballo | 1 | | Juana Baeza | 1 | | Micaela Laboy | 1 | | Veronica Altamirano | 1 | | Pilar Alarcon | 6 | | Reina Rivero | 6 | | Elias Merino | 7 | | Minerva Guerrero | 8 | | Mauro Balderas | 9 | | Cristal Soliz | 11 | | Daniela Menendez | 11 | | Marco Antonio Barrera | 11 | | Bernardo Narvaez | 14 | | Oscar Barrios | 15 | * Use union instead of union all to prevent infinite recursion 88: query from Python src/basic_python_query.py import sqlite3 connection = sqlite3.connect("db/penguins.db") cursor = connection.execute("select count(*) from penguins;") rows = cursor.fetchall() print(rows) out/basic_python_query.out [(344,)] * sqlite3 is part of Python's standard library * Create a connection to a database file * Get a cursor by executing a query + More common to create cursor and use that to run queries * Fetch all rows at once as list of tuples 89: incremental fetch src/incremental_fetch.py import sqlite3 connection = sqlite3.connect("db/penguins.db") cursor = connection.cursor() cursor = cursor.execute("select species, island from penguins limit 5;") while row := cursor.fetchone(): print(row) out/incremental_fetch.out ('Adelie', 'Torgersen') ('Adelie', 'Torgersen') ('Adelie', 'Torgersen') ('Adelie', 'Torgersen') ('Adelie', 'Torgersen') * cursor.fetchone returns None when no more data * There is also fetchmany(N) to fetch (up to) a certain number of rows 90: insert, delete, and all that src/insert_delete.py import sqlite3 connection = sqlite3.connect(":memory:") cursor = connection.cursor() cursor.execute("create table example(num integer);") cursor.execute("insert into example values (10), (20);") print("after insertion", cursor.execute("select * from example;").fetchall()) cursor.execute("delete from example where num < 15;") print("after deletion", cursor.execute("select * from example;").fetchall()) out/insert_delete.out after insertion [(10,), (20,)] after deletion [(20,)] * Each execute is its own transaction 91: interpolate values src/interpolate.py import sqlite3 connection = sqlite3.connect(":memory:") cursor = connection.cursor() cursor.execute("create table example(num integer);") cursor.executemany("insert into example values (?);", [(10,), (20,)]) print("after insertion", cursor.execute("select * from example;").fetchall()) out/interpolate.out after insertion [(10,), (20,)] * From XKCD XKCD Exploits of a Mom 92: script execution src/script_execution.py import sqlite3 SETUP = """\ drop table if exists example; create table example(num integer); insert into example values (10), (20); """ connection = sqlite3.connect(":memory:") cursor = connection.cursor() cursor.executescript(SETUP) print("after insertion", cursor.execute("select * from example;").fetchall()) out/script_execution.out after insertion [(10,), (20,)] * But what if something goes wrong? 93: SQLite exceptions in Python src/exceptions.py import sqlite3 SETUP = """\ create table example(num integer check(num > 0)); insert into example values (10); insert into example values (-1); insert into example values (20); """ connection = sqlite3.connect(":memory:") cursor = connection.cursor() try: cursor.executescript(SETUP) except sqlite3.Error as exc: print(f"SQLite exception: {exc}") print("after execution", cursor.execute("select * from example;").fetchall()) out/exceptions.out SQLite exception: CHECK constraint failed: num > 0 after execution [(10,)] 94: Python in SQLite src/embedded_python.py import sqlite3 SETUP = """\ create table example(num integer); insert into example values (-10), (10), (20), (30); """ def clip(value): if value < 0: return 0 if value > 20: return 20 return value connection = sqlite3.connect(":memory:") connection.create_function("clip", 1, clip) cursor = connection.cursor() cursor.executescript(SETUP) for row in cursor.execute("select num, clip(num) from example;").fetchall(): print(row) out/embedded_python.out (-10, 0) (10, 10) (20, 20) (30, 20) * SQLite calls back into Python to execute the function * Other databases can run Python (and other languages) in the database server process * Be careful 95: handle dates and times src/dates_times.py from datetime import date import sqlite3 # Convert date to ISO-formatted string when writing to database def _adapt_date_iso(val): return val.isoformat() sqlite3.register_adapter(date, _adapt_date_iso) # Convert ISO-formatted string to date when reading from database def _convert_date(val): return date.fromisoformat(val.decode()) sqlite3.register_converter("date", _convert_date) SETUP = """\ create table events( happened date not null, description text not null ); """ connection = sqlite3.connect(":memory:", detect_types=sqlite3.PARSE_DECLTYPES) cursor = connection.cursor() cursor.execute(SETUP) cursor.executemany( "insert into events values (?, ?);", [(date(2024, 1, 10), "started tutorial"), (date(2024, 1, 29), "finished tutorial")], ) for row in cursor.execute("select * from events;").fetchall(): print(row) out/dates_times.out (datetime.date(2024, 1, 10), 'started tutorial') (datetime.date(2024, 1, 29), 'finished tutorial') * sqlite3.PARSE_DECLTYPES tells sqlite3 library to use converts based on declared column types * Adapt on the way in, convert on the way out 96: SQL in Jupyter notebooks src/install_jupysql.sh pip install jupysql * And then inside the notebook: src/load_ext.text %load_ext sql * Loads extension src/jupyter_connect.text %sql sqlite:///data/penguins.db out/jupyter_connect.out Connecting to 'sqlite:///data/penguins.db' * Connects to database + sqlite:// with two slashes is the protocol + /data/penguins.db (one leading slash) is a local path * Single percent sign %sql introduces one-line command * Use double percent sign %%sql to indicate that the rest of the cell is SQL src/jupyter_select.text %%sql select species, count(*) as num from penguins group by species; out/jupyter_select.out Running query in 'sqlite:///data/penguins.db' species num Adelie 152 Chinstrap 68 Gentoo 124 97: Pandas and SQL src/install_pandas.sh pip install pandas src/select_pandas.py import pandas as pd import sqlite3 connection = sqlite3.connect("db/penguins.db") query = "select species, count(*) as num from penguins group by species;" df = pd.read_sql(query, connection) print(df) out/select_pandas.out species num 0 Adelie 152 1 Chinstrap 68 2 Gentoo 124 * Be careful about datatype conversion 98: Polars and SQL src/install_polars.sh pip install polars pyarrow adbc-driver-sqlite src/select_polars.py import polars as pl query = "select species, count(*) as num from penguins group by species;" uri = "sqlite:///db/penguins.db" df = pl.read_database_uri(query, uri, engine="adbc") print(df) out/select_polars.out shape: (3, 2) +-----------+-----+ | species | num | | --- | --- | | str | i64 | +-----------+-----+ | Adelie | 152 | | Chinstrap | 68 | | Gentoo | 124 | +-----------+-----+ * The Uniform Resource Identifier (URI) specifies the database * The query is the query * Use the ADBC engine instead of the default ConnectorX 99: object-relational mapper src/orm.py from sqlmodel import Field, Session, SQLModel, create_engine, select class Department(SQLModel, table=True): ident: str = Field(default=None, primary_key=True) name: str building: str engine = create_engine("sqlite:///db/assays.db") with Session(engine) as session: statement = select(Department) for result in session.exec(statement).all(): print(result) out/orm.out building='Chesson' name='Genetics' ident='gen' building='Fashet Extension' name='Histology' ident='hist' building='Chesson' name='Molecular Biology' ident='mb' building='TGVH' name='Endocrinology' ident='end' * An object-relational mapper (ORM) translates table columns to object properties and vice versa * SQLModel relies on Python type hints 100: relations with ORM src/orm_relation.py class Staff(SQLModel, table=True): ident: str = Field(default=None, primary_key=True) personal: str family: str dept: Optional[str] = Field(default=None, foreign_key="department.ident") age: int engine = create_engine("sqlite:///db/assays.db") SQLModel.metadata.create_all(engine) with Session(engine) as session: statement = select(Department, Staff).where(Staff.dept == Department.ident) for dept, staff in session.exec(statement): print(f"{dept.name}: {staff.personal} {staff.family}") out/orm_relation.out Histology: Divit Dhaliwal Molecular Biology: Indrans Sridhar Molecular Biology: Pranay Khanna Histology: Vedika Rout Genetics: Abram Chokshi Histology: Romil Kapoor Molecular Biology: Ishaan Ramaswamy Genetics: Nitya Lal * Make foreign keys explicit in class definitions * SQLModel automatically does the join + The two staff with no department aren't included in the result Appendices Terms 1-to-1 relation A relationship between two tables in which each record from the first table matches exactly one record from the second and vice versa. 1-to-many relation A relationship between two tables in which each record from the first table matches zero or more records from the second, but each record from the second table matches exactly one record from the first. aggregation Combining several values to produce one. aggregation function A function used to produce one value from many, such as maximum or addition. alias An alternate name used temporarily for a table or column. atomic An operation that cannot be broken into smaller operations. autoincrement Automatically add one to a value. base case A starting point for recursion that does not depend on previous recursive calculations. Binary Large Object (blob) Bytes that are handled as-is rather than being interpreted as numbers, text, or other data types. cross join A join that creates the cross-product of rows from two tables. common table expression (CTE) A temporary table created at the start of a query, usually to simplify writing the query. consistent A state in which all constraints are satisfied, e.g., all columns contain allowed values and all foreign keys refer to primary keys. correlated subquery A subquery that depends on a value or values from the enclosing query, and which must therefore be executed once for each of those values. cursor A reference to the current location in the results of an ongoing query. data migration To move data from one form to another, e.g., from one set of tables to a new set or from one DBMS to another. database A collection of data that can be searched and retrieved. database management system (DBMS) A program that manages a particular kind of database. denormalization To deliberately introduce duplication or other violate normal forms, typically to improve query performance. durable Guaranteed to survive shutdown and restart. entity-relationship diagram A graphical depiction of the relationships between tables in a database. filter To select records based on whether they pass some Boolean test. foreign key A value in one table that identifies a primary key in another table. full outer join See cross join. group A set of records that share a common property, such as having the same value in a particular column. in-memory database A database that is stored in memory rather than on disk. index An auxiliary data structure that enables faster access to records. infinite recursion See "infinite recursion". isolated The appearance of having executed in an otherwise-idle system. join To combine records from two tables. join condition The criteria used to decide which rows from each table in a join are combined. join table A table that exists solely to enable information from two tables to be connected. left outer join A join that is guaranteed to keep all rows from the first (left) table. Columns from the right table are filled with actual values if available or with null otherwise. many-to-many relation A relationship between two tables in which each record from the first table may match zero or more records from the second and vice versa. materialized view A view that is stored on disk and updated on demand. normal form One of several (loosely defined) rules for organizing data in tables. NoSQL database Any database that doesn't use the relational model. null A special value representing "not known". object-relational mapper (ORM) A library that translates objects in a program into database queries and the results of those queries back into objects. path expression An expression identifying an element or a set of elements in a JSON structure. primary key A value or values in a database table that uniquely identifies each record in that table. query A command to perform some operation in a database (typically data retrieval). recursive CTE A common table expression that refers to itself. Every recursive CTE must have a base case and a recursive case. recursive case The second or subsequent step in self-referential accumulation of data. relational database management system (RDBMS) A database management system that stores data in tables with columns and rows. subquery A query used within another query. table-valued function A function that returns multiple values rather than a single value. temporary table A table that is explicitly constructed in memory outside any particular query. ternary logic A logic based on three values: true, false, and "don't know" (represented as null). tombstone A marker value added to a record to show that it is no longer active. Tombstones are used as an alternative to deleting data. trigger An action that runs automatically when something happens in a database, typically insertion or deletion. upsert To update a record if it exists or insert (create) a new record if it doesn't. Uniform Resource Identifier (URI) A string that identifies a resource (such as a web page or database) and the protocol used to access it. view A rearrangement of data in a database that is regenerated on demand. window function A function that combines data from adjacent rows in a database query's result. Acknowledgments This tutorial would not have been possible without: * Andi Albrecht's sqlparse module * Dimitri Fontaine's The Art of PostgreSQL * David Rozenshtein's The Essence of SQL (now sadly out of print) I would also like to thank the following for spotting issues, making suggestions, or submitting changes: * Sam Hames * Robert Kern * Roy Pardee * Manos Pitsidianakis * Daniel Possenriede * Adam Rosien * Thomas Sandmann 2024 Greg Wilson *