https://death.andgravity.com/query-builder-how

  * death and gravity

Write an SQL query builder in 150 lines of Python!

2021-08-20 [?] 20 minute read

Previously

This is the fourth article in a series about writing an SQL query
builder for my feed reader library.

Today, we'll dive into the code by rewriting it from scratch.

Think of it as part walk-through, part tutorial; along the way, we'll
explore:

  * API design
  * knowing when to be lazy
  * worse and better ways of doing things

As you read this, keep in mind it is a story, thus linear by
necessity. Development was decidedly not so: I tried things out, I
changed my mind multiple times, and I rewrote everything once. Even
now, there are other equally-good or better implementations; this one
is simply good enough.

Contents

  * What are we trying to build?
      + Trade-offs
  * A minimal plausible solution
      + Data representation
      + Classes
      + Adding things
      + Output
      + Tests
  * Separators
  * Aliases
  * Subqueries
  * Joins
  * Distinct
  * More tests
  * More init
  * Bonus: things that didn't make the cut
      + Insert / update / delete
      + Arbitrary strings as subqueries
      + Query objects as subqueries
      + Union / intersect / except

What are we trying to build? #

We want a way of building SQL strings that takes care of formatting:

>>> query = Query()
>>> query.SELECT('one').FROM('table')
<builder.Query object at 0x7fc953e60640>
>>> print(query)
SELECT
    one
FROM
    table

... and allows us to add parts incrementally:

>>> query.SELECT('two').WHERE('condition')
<builder.Query object at 0x7fc953e60640>
>>> print(query)
SELECT
    one,
    two
FROM
    table
WHERE
    condition

While not required, I recommend reading the previous articles to get
a better idea of the problem we're trying to solve, and the context
we're solving it in.

In short, whatever we build should:

  * support SELECT with conditional WITH, WHERE, ORDER BY, JOIN etc.
  * expose the names of the result columns (for scrolling window
    queries)
  * be easy to use, understand and maintain

This query builder is not directly comparable with that of an ORM.
Instead, it is an alternative to building plain SQL strings by hand.

The caveats that apply to plain SQL apply to it as well: Using
user-supplied values directly in an SQL query exposes you to SQL
injection attacks. Instead, use parametrized queries whenever
possible, and escaping only as a last resort.

Trade-offs #

Our solution does not exist in a void; it exists to be used by my
feed reader library.

Notably, we're not making a general-purpose library with external
users whose needs we're trying to anticipate; there's exactly one
user with a pretty well-defined use case, and strict backwards
compatibility is not necessary.

This allows us to make some upfront decisions to help with
maintainability:

  * No needless customization. We can change the code directly if we
    need to.
  * No other features except the known requirements. We can add new
    ones when we need them.
  * No effort to support other syntax than SQLite.
  * No extensive testing. We can rely on the exising comprehensive
    functional tests.
  * No SQL validation. The database does this already.
      + However, it would be nice to get at least a little error
        checking. No need for custom exceptions, any kind is
        acceptable - they should come up only during development and
        testing anyway.

A minimal plausible solution #

Data representation #

As mentioned before, my prototype was based on the idea that queries
can be represented as plain data structures.

Looking at a nicely formatted query, a natural representation may
reveal itself:

SELECT
    one,
    two
FROM
    table
WHERE
    condition AND
    another-condition

See it?

It's a mapping with a list of strings for each clause:

{
    'SELECT': [
        'one',
        'two',
    ],
    'FROM': [
        'table',
    ],
    'WHERE': [
        'condition',
        'another-condition',
    ],
}

Let's use this as our starting model, and make ourselves a query
builder.

Classes #

We start with a class:

 2 class Query:
 3
 4     keywords = [
 5         'WITH',
 6         'SELECT',
 7         'FROM',
 8         'WHERE',
 9         'GROUP BY',
10         'HAVING',
11         'ORDER BY',
12         'LIMIT',
13     ]
14
15     def __init__(self):
16         self.data = {k: [] for k in self.keywords}

We use a class because most of the time we don't want to interact
with the underlying data structure, since it's more likely to change.
We're not subclassing dict, since that would unintentionally expose
its methods (and thus, behavior), and we may need those names for
something else.

Also, a class allows us to reduce verbosity:

# we want
query.SELECT('one', 'two').FROM('table')
# not
query['SELECT'].extend(['one', 'two'])
query['FROM'].append('table')

We use class variables for "static" data instead of hardcoding or
module variables so it's easy to override (more on that later).

We don't customize anything in __init__() fow now; if we need more
clauses, we can add them to keywords directly. Adding all known
keywords to data upfront gets us free error checking: data[keyword]
raises KeyError for unknown keywords.

Unless specified otherwise, I'll use clause and keyword to mean "item
in self.data", not "SQL clause or keyword in general".

We could use dataclasses, but of the generated magic methods, we'd
only use __repr__(), and its output would be too long to be useful
anyway.

Adding things #

Next, we add code for adding string fragments to each clause:

18     def add(self, keyword, *args):
19         target = self.data[keyword]
20
21         for arg in args:
22             target.append(_clean_up(arg))
23
24         return self
25
26     def __getattr__(self, name):
27         if not name.isupper():
28             return getattr(super(), name)
29         return functools.partial(self.add, name.replace('_', ' '))
30
31
32 def _clean_up(thing: str) -> str:
33     return textwrap.dedent(thing.rstrip()).strip()

add() is roughly equivalent to data[keyword] .extend(args).

The main difference is that we dedent the arguments and remove
trailing whitespace. This is intentional: we clean everything up and
make as many choices when adding things, so we don't have to care
about that when generating output, and so error checking happens as
early as possible.

Also, add() returns self to enable method chaining: query .add(...)
 .add(...).

---------------------------------------------------------------------

__getattr__() is called when an attribute does not exist, and allows
us to return something instead of getting the default AttributeError.

What we return is a KEYWORD(*args) callable made on the fly by
wrapping add() in a partial; a closure capturing name would be
functionally equivalent.

Requiring the keywords to be uppercase is a stylistic choice, but
does have advantages: it signals to the reader these are special
"methods", and avoids shadowing dunder methods like __deepcopy__()
without extra checks.

To indicate the attribute really doesn't exist, we need to raise
AttributeError; we let getattr() do it for us (the parent object
doesn't have a custom __getattr__()).

We could store the partial on the instance, which would side-step
__getattr__() on subsequent calls, so we only make one partial per
keyword; we could do it in __init__(), and not use __getattr__() at
all; we could even use partialmethod, so there's only one per keyword
per class! Or we can do nothing - they're likely premature
optimization, and what we're doing now is more flexible anyway.

---------------------------------------------------------------------

I said error checking happens as early as possible; that's almost
true: if you look carefully at the code, you may notice query .ESLECT
doesn't raise an exception until called - query .ESLECT().

Doing most of the work in add() does have some benefits, though: we
can use it with partial and get chaining for free, and it's an escape
hatch for when we want to use a "keyword" that's not a Python
identifier (this will be useful later).

Output #

Finally, we turn the query into SQL:

18     default_separator = ','

36     def __str__(self):
37         return ''.join(self._lines())
38
39     def _lines(self):
40         for keyword, things in self.data.items():
41             if not things:
42                 continue
43
44             yield f'{keyword}\n'
45             yield from self._lines_keyword(keyword, things)
46
47     def _lines_keyword(self, keyword, things):
48         for i, thing in enumerate(things, 1):
49             last = i == len(things)
50
51             yield self._indent(thing)
52
53             if not last:
54                 yield self.default_separator
55
56             yield '\n'
57
58     _indent = functools.partial(textwrap.indent, prefix='    ')

The only output API is str(); being the standard way of turning
objects into strings in Python, it requires zero effort to learn.

str(query) calls __str__, which delegates to _lines(). We use a
generator mainly because it allows us to write yield line instead of
rv.append(line), making for somewhat cleaner code.

Another benefit of a generator is that it's lazy, so we can pass it
around without having to build intermediary lists in memory; for
example, to a file's writelines() method, or in yield from in another
generator (e.g. for nested subqueries). We don't need it here, but
it's useful when generating a lot of values.

We split the logic for individual clauses into _lines_keyword(),
because we'll keep adding stuff to it. (I initially left everything
in _lines(), and refactored when things got too complicated; no need
to do that now.)

Since we'll want to indent things in the same way in more than one
place, we make it a static "method" using partial.

You may notice we're not sorting the clauses in any way; dicts
guarantee insertion order in Python 3.6+^1, and we built data from
keywords, so the order is preserved.

Tests #

Let's add a simple test to make sure we don't break already working
stuff:

 6 def test_query_simple():
 7     query = Query().SELECT('select').FROM('from-one', 'from-two')
 8     assert str(query) == dedent(
 9         """\
10         SELECT
11             select
12         FROM
13             from-one,
14             from-two
15         """
16     )

We'll keep adding to it with each feature.

---------------------------------------------------------------------

For a minimal solution, we are done. We've "spent" 62 lines, or 38
statements.

The code so far: builder.py, test_builder.py.

Separators #

At this point, WHERE doesn't really make sense:

>>> print(Query().WHERE('a', 'b'))
WHERE
    a,
    b

We fix it by special-casing separators for a few clauses:

18     separators = dict(WHERE='AND', HAVING='AND')

48     def _lines_keyword(self, keyword, things):
49         for i, thing in enumerate(things, 1):
50             last = i == len(things)
51
52             yield self._indent(thing)
53
54             if not last:
55                 try:
56                     yield ' ' + self.separators[keyword]
57                 except KeyError:
58                     yield self.default_separator
59 
60             yield '\n'

We could've used defaultdict instead of using default_separator, but
then we'd have to remember non-comma separators need a space: ' AND';
putting it in code means we don't have to remember anything.

Also, we could've put the separator on a new line: 'one\n AND   two'
vs. 'one   AND\n two'. While slightly better style, it makes code
more complicated for little benefit, and makes it less obvious that
AND is just another separator.

We add WHERE to the test.

 6 def test_query_simple():
 7     query = (
 8         Query()
 9         .SELECT('select')
10         .FROM('from-one', 'from-two')
11         .WHERE('where-one', 'where-two')
12     )
13     assert str(query) == dedent(
14         """\
15         SELECT
16             select
17         FROM
18             from-one,
19             from-two
20         WHERE
21             where-one AND
22             where-two
23         """
24     )

The code so far: builder.py, test_builder.py.

Aliases #

One of the requirements is that it should be possible to implement
scrolling window queries on top. For this, code needs to get the
result column names - the SELECT expressions or their aliases - and
add them to a generated WHERE condition.

Parsing the result column is straightforward only for simple cases:

>>> query = Query().SELECT(
...   'column',
...   'column AS alias',
...   'column as alias',
...   '(SELECT column FROM table AS another-table)',
... )
>>> [s.rpartition(' AS ')[2] for s in query.data['SELECT']]
['column', 'alias', 'column as alias', 'another-table)']

An acceptable compromise is using pairs of strings for aliased
columns. Since the column expression might be quite long, we'll make
the alias the first thing in the pair.

>>> print(Query().SELECT(('alias', 'one'), 'two'))
SELECT
    one AS alias,
    two

As mentioned earlier, we store everything in a standard way to keep
output code simpler. A plain 2-tuple is a decent choice, but a named
tuple is more readable.

66 class _Thing(NamedTuple):
67     value: str
68     alias: str = ''
69
70     @classmethod
71     def from_arg(cls, arg):
72         if isinstance(arg, str):
73             alias, value = '', arg
74         elif len(arg) == 2:
75             alias, value = arg
76         else:
77             raise ValueError(f"invalid arg: {arg!r}")
78         return cls(_clean_up(value), _clean_up(alias))

Conveniently, this gives us a place where to convert the
string-or-pair: the from_arg() alternate constructor. We could've
made it a stand-alone function, but this way it's easier to see what
type is being returned.

Note that we use an empty string to mean "no alias". In general, it's
a good idea to distinguish this kind of absence by using None, since
the empty string may be a valid input, and None can prevent some bugs
- e.g. you can't concatenate None to a string. Here, an empty string
cannot be a valid alias, and we use format strings, so we don't
bother.

Using it is just a one-line change to add():

25     def add(self, keyword, *args):
26         target = self.data[keyword]
27
28         for arg in args:
29             target.append(_Thing.from_arg(arg))
30 
31         return self

On output, we have two concerns:

 1. there may or may not be an alias
 2. the order differs depending on the keyword: you have SELECT expr
    AS column-alias, but WITH table-name AS (stmt) (we treat the CTE
    table name as an alias)

We can model this with mostly-empty defaultdicts with per-clause
format strings:

23     formats = (
24         defaultdict(lambda: '{value}'),
25         defaultdict(lambda: '{value} AS {alias}', WITH='{alias} AS {value}'),
26     )

... and choose the right defaultdict using the alias's boolean value:
^2

55     def _lines_keyword(self, keyword, things):
56         for i, thing in enumerate(things, 1):
57             last = i == len(things)
58
59             format = self.formats[bool(thing.alias)][keyword]
60             yield self._indent(format.format(value=thing.value, alias=thing.alias))
61 
62             if not last:
63                 try:
64                     yield ' ' + self.separators[keyword]
65                 except KeyError:
66                     yield self.default_separator
67
68             yield '\n'

We add an aliased expression to the test.

 6 def test_query_simple():
 7     query = (
 8         Query()
 9         .SELECT('select-one', ('alias', 'select-two'))
10         .FROM('from-one', 'from-two')
11         .WHERE('where-one', 'where-two')
12     )
13     assert str(query) == dedent(
14         """\
15         SELECT
16             select-one,
17             select-two AS alias
18         FROM
19             from-one,
20             from-two
21         WHERE
22             where-one AND
23             where-two
24         """
25     )

The code so far: builder.py, test_builder.py.

Subqueries #

Currently, WITH is still a little broken:

>>> print(Query().WITH(('table-name', 'SELECT 1')))
WITH
    table-name AS SELECT 1

Since common table expressions always have the SELECT statement
paranthesized, we'd like to have it out of the box, with proper
indentation:

WITH
    table-name AS (
        SELECT 1
    )

A simple way of handling this is to change the WITH format string to
'{alias} AS (\n{indented}\n)', where indented is the value, but
indented.^3

This kinda works, but is limited in usefulness; for instance, we
can't easily build something like this on top:

Query().FROM(('alias', 'SELECT 1'), is_subquery=True)

Instead, let's keep refining our model, and use a flag to mark
subqueries:

73 class _Thing(NamedTuple):
74     value: str
75     alias: str = ''
76     is_subquery: bool = False
77 
78     @classmethod
79     def from_arg(cls, arg, **kwargs):
80         if isinstance(arg, str):
81             alias, value = '', arg
82         elif len(arg) == 2:
83             alias, value = arg
84         else:
85             raise ValueError(f"invalid arg: {arg!r}")
86         return cls(_clean_up(value), _clean_up(alias), **kwargs)

We can then check if a clause always has subqueries, and set the flag
accordingly:

28     subquery_keywords = {'WITH'}

33     def add(self, keyword, *args):
34         target = self.data[keyword]
35
36         kwargs = {}
37         if keyword in self.subquery_keywords:
38             kwargs.update(is_subquery=True)
39 
40         for arg in args:
41             target.append(_Thing.from_arg(arg, **kwargs))
42 
43         return self

Using it for output is just an extra if:

61     def _lines_keyword(self, keyword, things):
62         for i, thing in enumerate(things, 1):
63             last = i == len(things)
64
65             format = self.formats[bool(thing.alias)][keyword]
66             value = thing.value
67             if thing.is_subquery:
68                 value = f'(\n{self._indent(value)}\n)'
69             yield self._indent(format.format(value=value, alias=thing.alias))
70
71             if not last:
72                 try:
73                     yield ' ' + self.separators[keyword]
74                 except KeyError:
75                     yield self.default_separator
76
77             yield '\n'

We add WITH to our test.

 6 def test_query_simple():
 7     query = (
 8         Query()
 9         .WITH(('alias', 'with'))
10         .SELECT('select-one', ('alias', 'select-two'))
11         .FROM('from-one', 'from-two')
12         .WHERE('where-one', 'where-two')
13     )
14     assert str(query) == dedent(
15         """\
16         WITH
17             alias AS (
18                 with
19             )
20         SELECT
21             select-one,
22             select-two AS alias
23         FROM
24             from-one,
25             from-two
26         WHERE
27             where-one AND
28             where-two
29         """
30     )

The code so far: builder.py, test_builder.py.

Joins #

One clause that's entirely missing is JOIN. And it's important,
changing your mind about what you're selecting from happens quite
often.

JOIN is a bit more complicated, mostly because it has different forms
- JOIN, LEFT JOIN and so on; SQLite supports at least 10 variations.

I initially treated any keyword containing JOIN as a separate
keyword, and dealing with it during output. This has a few drawbacks,
though; aside from making the code more complicated, it reorders the
tables: query .JOIN('a') .LEFT_JOIN('b') .JOIN('c') results in JOIN a
JOIN c LEFT JOIN b.

A better solution is to refine our model even further.

Take a look at these railroad diagrams for the SELECT statement:

select-core (FROM clause)

select-core

join-clause

join-clause

join-operator

join-operator

You may notice table-or-subquery followed by , in FROM is actually a
subset of table-or-subquery followed by join-operator in join-clause.
That is, for SQLite, a comma is just another join operator.

Put the other way around, a join operator is just another separator.

Because our separators come after things, not before, we'll model
join operators separately, as fake keywords (that is, not used to
index into data).

First, let's set them:

82 class _Thing(NamedTuple):
83     value: str
84     alias: str = ''
85     keyword: str = ''
86     is_subquery: bool = False

29     fake_keywords = dict(JOIN='FROM')

34     def add(self, keyword, *args):
35         keyword, fake_keyword = self._resolve_fakes(keyword)
36         target = self.data[keyword]
37
38         kwargs = {}
39         if fake_keyword:
40             kwargs.update(keyword=fake_keyword)
41         if keyword in self.subquery_keywords:
42             kwargs.update(is_subquery=True)
43
44         for arg in args:
45             target.append(_Thing.from_arg(arg, **kwargs))
46
47         return self
48
49     def _resolve_fakes(self, keyword):
50         for part, real in self.fake_keywords.items():
51             if part in keyword:
52                 return real, keyword
53         return keyword, ''

We could've probably just hardcoded this in add() (if 'JOIN' in
keyword: ...), but doing it like this makes it easier to see at a
glance that "JOIN is a fake FROM".

Using keyword as a separator is relatively straightforward:

63     def _lines(self):
64         for keyword, things in self.data.items():
65             if not things:
66                 continue
67
68             yield f'{keyword}\n'
69
70             grouped = [], []
71             for thing in things:
72                 grouped[bool(thing.keyword)].append(thing)
73             for group in grouped:
74                 yield from self._lines_keyword(keyword, group)
75 
76     def _lines_keyword(self, keyword, things):
77         for i, thing in enumerate(things, 1):
78             last = i == len(things)
79
80             if thing.keyword:
81                 yield thing.keyword + '\n'
82 
83             format = self.formats[bool(thing.alias)][keyword]
84             value = thing.value
85             if thing.is_subquery:
86                 value = f'(\n{self._indent(value)}\n)'
87             yield self._indent(format.format(value=value, alias=thing.alias))
88
89             if not last and not thing.keyword:
90                 try:
91                     yield ' ' + self.separators[keyword]
92                 except KeyError:
93                     yield self.default_separator
94
95             yield '\n'

Since FROM always comes before JOIN, we make sure to output the real
ones first.

We add a JOIN to the test.

 6 def test_query_simple():
 7     query = (
 8         Query()
 9         .WITH(('alias', 'with'))
10         .SELECT('select-one', ('alias', 'select-two'))
11         .FROM('from-one', 'from-two')
12         .JOIN('join')
13         .WHERE('where-one', 'where-two')
14     )
15     assert str(query) == dedent(
16         """\
17         WITH
18             alias AS (
19                 with
20             )
21         SELECT
22             select-one,
23             select-two AS alias
24         FROM
25             from-one,
26             from-two
27         JOIN
28             join
29         WHERE
30             where-one AND
31             where-two
32         """
33     )

The code so far: builder.py, test_builder.py.

Distinct #

The final model change is to support SELECT DISTINCT.

DISTINCT and ALL are flags that apply to the whole clause; we'll
model them as such:

117 class _FlagList(list):
118     flag: str = ''

31     def __init__(self):
32         self.data = {k: _FlagList() for k in self.keywords}

Since most of the time we're OK with the default flag, we don't
bother setting it in __init__, and use a class variable instead. If
we need to customize it, we can set flag on the instance, shadowing
the class variable.

A __repr__ showing the flag would be nice, but it'd only be useful
during debugging, so we skip it as well.

We set the flag based on a known set for each clause; like with fake
keywords, we pull the "parsing" logic into a separate method:

30     flag_keywords = dict(SELECT={'DISTINCT', 'ALL'})

35     def add(self, keyword, *args):
36         keyword, fake_keyword = self._resolve_fakes(keyword)
37         keyword, flag = self._resolve_flags(keyword)
38         target = self.data[keyword]
39
40         if flag:
41             if target.flag:
42                 raise ValueError(f"{keyword} already has flag: {flag!r}")
43             target.flag = flag
44 
45         kwargs = {}
46         if fake_keyword:
47             kwargs.update(keyword=fake_keyword)
48         if keyword in self.subquery_keywords:
49             kwargs.update(is_subquery=True)
50
51         for arg in args:
52             target.append(_Thing.from_arg(arg, **kwargs))
53
54         return self

62     def _resolve_flags(self, keyword):
63         prefix, _, flag = keyword.partition(' ')
64         if prefix in self.flag_keywords:
65             if flag and flag not in self.flag_keywords[prefix]:
66                 raise ValueError(f"invalid flag for {prefix}: {flag!r}")
67             return prefix, flag
68         return keyword, ''

Using it for output is again straightforward:

78     def _lines(self):
79         for keyword, things in self.data.items():
80             if not things:
81                 continue
82
83             if things.flag:
84                 yield f'{keyword} {things.flag}\n'
85             else:
86                 yield f'{keyword}\n'
87 
88             grouped = [], []
89             for thing in things:
90                 grouped[bool(thing.keyword)].append(thing)
91             for group in grouped:
92                 yield from self._lines_keyword(keyword, group)

We add a SELECT DISTINCT to our test.

 6 def test_query_simple():
 7     query = (
 8         Query()
 9         .WITH(('alias', 'with'))
10         .SELECT('select-one', ('alias', 'select-two'))
11         .FROM('from-one', 'from-two')
12         .JOIN('join')
13         .WHERE('where-one', 'where-two')
14         .SELECT_DISTINCT('select-three')
15     )
16     assert str(query) == dedent(
17         """\
18         WITH
19             alias AS (
20                 with
21             )
22         SELECT DISTINCT
23             select-one,
24             select-two AS alias,
25             select-three
26         FROM
27             from-one,
28             from-two
29         JOIN
30             join
31         WHERE
32             where-one AND
33             where-two
34         """
35     )

The code so far: builder.py, test_builder.py.

More tests #

Our only test isn't all that simple anymore; maybe it's time to split
it in two: one with a really simple query, and one with a really
complicated query.

... something like this.

  6 def test_query_simple():
  7     query = Query().SELECT('select').FROM('from').JOIN('join').WHERE('where')
  8     assert str(query) == dedent(
  9         """\
 10         SELECT
 11             select
 12         FROM
 13             from
 14         JOIN
 15             join
 16         WHERE
 17             where
 18         """
 19     )
 20
 21
 22 def test_query_complicated():
 23     """Test a complicated query:
 24
 25     * order between different keywords does not matter
 26     * arguments of repeated calls get appended, with the order preserved
 27     * SELECT can receive 2-tuples
 28     * WHERE and HAVING arguments are separated by AND
 29     * JOIN arguments are separated by the keyword, and come after plain FROM
 30     * no-argument keywords have no effect, unless they are flags
 31
 32     """
 33     query = (
 34         Query()
 35         .WHERE()
 36         .OUTER_JOIN('outer join')
 37         .JOIN('join')
 38         .LIMIT('limit')
 39         .JOIN()
 40         .ORDER_BY('first', 'second')
 41         .SELECT('one')
 42         .HAVING('having')
 43         .SELECT(('two', 'expr'))
 44         .GROUP_BY('group by')
 45         .FROM('from')
 46         .SELECT('three', 'four')
 47         .FROM('another from')
 48         .WHERE('where')
 49         .ORDER_BY('third')
 50         .OUTER_JOIN('another outer join')
 51         # this isn't technically valid
 52         .WITH('first cte')
 53         .GROUP_BY('another group by')
 54         .HAVING('another having')
 55         .WITH(('fancy', 'second cte'))
 56         .JOIN('another join')
 57         .WHERE('another where')
 58         .NATURAL_JOIN('natural join')
 59         .SELECT()
 60         .SELECT_DISTINCT()
 61     )
 62     assert str(query) == dedent(
 63         """\
 64         WITH
 65             (
 66                 first cte
 67             ),
 68             fancy AS (
 69                 second cte
 70             )
 71         SELECT DISTINCT
 72             one,
 73             expr AS two,
 74             three,
 75             four
 76         FROM
 77             from,
 78             another from
 79         OUTER JOIN
 80             outer join
 81         JOIN
 82             join
 83         OUTER JOIN
 84             another outer join
 85         JOIN
 86             another join
 87         NATURAL JOIN
 88             natural join
 89         WHERE
 90             where AND
 91             another where
 92         GROUP BY
 93             group by,
 94             another group by
 95         HAVING
 96             having AND
 97             another having
 98         ORDER BY
 99             first,
100             second,
101             third
102         LIMIT
103             limit
104         """
105     )

The code so far: builder.py, test_builder.py.

More init #

One last feature: I'd like to reuse the formatting logic for
paranthesized lists.

Good thing __init__ doesn't take any arguments yet:

32     def __init__(self, data=None, separators=None):
33         self.data = {}
34         if data is None:
35             data = dict.fromkeys(self.keywords, ())
36         for keyword, args in data.items():
37             self.data[keyword] = _FlagList()
38             self.add(keyword, *args)
39
40         if separators is not None:
41             self.separators = separators

Using it looks like:

>>> print(Query({'(': ['one', 'two'], ')': ['']}, separators={'(': 'OR'}))
(
    one OR
    two
)

We could have required data to have the same structure as the
attribute; however, it would be too verbose to use, and I'd have to
do all the clean up myself; that's not very convenient. Instead, we
make it mean "add() these strings for these keywords".^4

We add a separate test for the fancy __init__.

108 def test_query_init():
109     query = Query({'(': ['one', 'two', 'three'], ')': ['']}, {'(': 'OR'})
110     assert str(query) == dedent(
111         """\
112         (
113             one OR
114             two OR
115             three
116         )
117
118         """
119     )

---------------------------------------------------------------------

OK, now we're really done. We've spent 148 lines, or 101 statements.

The final version of the code: builder.py, test_builder.py. You can
find the type-annotated version used by reader on GitHub.

---------------------------------------------------------------------

That's it for now. :)

Learned something new today? Share this with others, it really helps!

Want more? Get updates via email or Atom feed.

This is my first planned series, and still a work in progress.

This means you get a say in it. Email me your questions or comments,
and I'll do my best to address them in one of the future articles.

Bonus: things that didn't make the cut #

When talking about trade-offs, I said we'll only add features as
needed; this may seem a bit handwavy - how can I tell adding them
won't make the code explode?

Because I did add them; that's what prototyping was for. But since
they weren't actually used, I removed them - there's no point in them
rotting away.

Here's how you'd go about implementing a few of them.

Insert / update / delete #

Make them flag keywords, to support the OR ABORT/FAIL/... variants.

To make VALUES bake in the parentheses, set its format to ({value}).
That's to add one values tuple at a time.

To add one column at a time, we could do this:

  * allow add()ing INSERT with arbitrary flags
  * make INSERT('column', into='table') a synonym of add('INSERT INTO
    table', 'column')
  * classify INSERT and VALUES as parens_keywords - like
    subquery_keywords, but they apply once per keyword, not per value

It'd look like this:

# first insert sets flag
query.INSERT('one', into='table').VALUES(':one')
# later we just add stuff
query.INSERT('two').VALUES(':two')

Arbitrary strings as subqueries #

Allow setting add(..., is_subquery=True); you'd then do:

query.FROM('subquery', is_subquery=True)
query.FROM('not subquery')

Query objects as subqueries #

Using Query objects as subqueries without having to convert them
explicitly to strings would allow changing them after being add()ed.

To do it, we just need to allow _Thing.value to be a Query, and
override its is_subquery based on an isinstance() check.

Union / intersect / except #

This one goes a bit meta:

  * add a "virtual" COMPOUND keyword
  * add a new compound(keyword) method, which moves everything except
    WITH and ORDER BY to a subquery, and appends the subquery to data
    ['COMPOUND'] with the appropriate fake keyword
  * make __getattr__ return a compound() partial for compound
    keywords
  * special-case COMPOUND in _lines()

---------------------------------------------------------------------

 1. The guaranteed insertion order was actually added to the language
    specification in 3.7, but in 3.6 both CPython and PyPy had it as
    an implementation detail. ^[return]

 2. This works because False == 0 and True == 1, and is likely too
    clever. ^[return]

 3. That's what I did initially. ^[return]

 4. An alternate constructor or a subclass might have been a better
    choice here. We'll fix it if we need __init__ for something else.
    -\_(tsu)_/- ^[return]

---------------------------------------------------------------------

This is part of a series:

  * SQL query builder in 150 lines of Python
  * Why use an SQL query builder in the first place?
  * Why I wrote my own SQL query builder (in Python)
  * Write an SQL query builder in 150 lines of Python! (this article)

Want to know when new articles come out?

Drop your email in the box below and I'll send new stuff straight to
your inbox!

[                    ]
[                    ]
[                    ]
[Subscribe]
home [?] feed [?] about [?] (c) 2021 lemon24