[HN Gopher] The curious case of the aggregation query
___________________________________________________________________
The curious case of the aggregation query
Author : eatonphil
Score : 70 points
Date : 2023-12-05 14:31 UTC (8 hours ago)
(HTM) web link (modern-sql.com)
(TXT) w3m dump (modern-sql.com)
| crazygringo wrote:
| > _Now the mysterious query: SELECT (SELECT sum(a) FROM xx LIMIT
| 1) FROM aa_
|
| To be clear, that's just an absolutely pathological query, since
| column "a" doesn't exist in table "xx" -- it only exists in table
| "aa".
|
| You _can_ use values from outer queries inside of inner queries,
| which are called "correlated subqueries". Which is _not_
| recommended unless you 're already a real SQL expert, because
| they're so confusing (and their performance is also pretty
| absymal).
|
| But this isn't even that -- this is an _aggregate function_
| across an outer variable. I consider myself a strong SQL expert,
| but looking at this query, it just feels nonsensical. Not only
| could I not tell you what it would produce -- I couldn 't even
| guess at what its author intended it to mean in the first place.
|
| > _Later in the article we will see that SQL implementations
| don't even agree about the correct result._
|
| I'm not surprised at all. I'm actually more surprised they
| produce results _at all_ , rather than just errors about it being
| an invalid query.
| MarkusWinand wrote:
| No question, such a query should not be written. That's
| probably the reason why this odd behavior, which is even
| different in various DBMSes, is not causing everyday problems.
| marcosdumay wrote:
| The reason the behavior doesn't cause problems is because
| everybody treats automatic aggregation as a voodoo where they
| know one recipe that works and anything different is domain
| of the Devil.
|
| And, IMO, that's a very sane and reasonable way to treat it.
| The entire idea of automatic aggregation is flawed, and those
| queries should just have a `group by ()` explicit at the
| right place.
| foldU wrote:
| I agree, I think the original sin here is the fact that
| whether a `SELECT` is an aggregation is determined by the
| contents of the scalar expressions at all. I think most of
| this weirdness comes directly out of wanting to be able to
| write both `SELECT sum(x) FROM xx` and `SELECT x FROM xx`
| and have them work.
|
| Not that I have a better solution offhand, in SQL grouping
| by a constant value is not actually the same as not writing
| `GROUP BY` at all since the behaviour on empty tables is
| different.
| marcosdumay wrote:
| Not writing `group by` is the same as writing `group by
| ()`.
|
| And yeah, the difference between that and a value is one
| of those really surprising things on SQL that actually
| make sense and should be this way. Unfortunately, there
| are many of those.
| foldU wrote:
| Ah my apologies, I wasn't familiar with that syntax.
| marcosdumay wrote:
| No need to apologize. My post didn't make that clear at
| all.
| derefr wrote:
| What's an aggregation per se? A SQL query is best thought
| of an arbitrary generator function. You can, using
| functions like UNNEST, end up emitting _multiple_ row-
| tuples for each processed input row-tuple. An aggregation
| is just a generator function that happens to reduce over
| all the input rows and then emit one row-tuple when there
| 's no more input. Query-planning engines do not special-
| case this. It's just a generator node like any other
| generator node.
|
| Consider: using window functions, you can do _partial_
| aggregations over subsets of the input -- without even
| necessarily partitioning the input (i.e. you can compute
| "running totals" and other wacky state-machine-like
| outputs.)
| indigo945 wrote:
| Would you argue that automatic scalar-ism is also flawed,
| and the query SELECT a FROM aa;
|
| should have an explicit grouping, like
| SELECT a FROM aa GROUP BY id;
|
| ?
|
| After all, when you think about it, it's really not the
| aggregate functions that break expectations, it's the
| scalars. Of the four combinations of having or not having
| an explicit `GROUP BY` and having or not having an
| aggregate function, three of them have the unsurprising
| behavior of returning exactly one row per grouping.
| -- aggregate function and group by - one single result per
| grouping SELECT sum(a) FROM aa GROUP BY a; --or
| SELECT sum(a) FROM aa GROUP BY (); -- no
| aggregate function, but a group by - one single result per
| grouping SELECT a FROM aa GROUP BY a;
| -- just aggregate function, but no group by - one single
| result per (single, implicit) grouping SELECT
| sum(a) FROM aa;
|
| But then, when you have neither an aggregate function nor
| an explicit `GROUP BY`, it breaks expectations:
| -- no aggregate function, no group by - one result row per
| row -- in the source set, even though there should
| be only one big implicit group SELECT a FROM aa;
|
| Therefore, I propose that the next SQL standard should
| introduce a new `GROUP BY INDIVIDUAL ROW` keyword, that
| henceforth all "scalar" queries MUST use in order to have
| consistent behaviour with the rest of the language.
| marcosdumay wrote:
| What is flawed is not that there is an implicit grouping
| on all queries. It's that the implicit grouping changes,
| depending on a set of rules that consider stuff written
| in several places that are not explicitly related to it.
|
| You are asking if it makes sense to have an implicit
| grouping at all; it very obviously absolutely does. And
| grouping by individual row is the very obvious default.
| But I do totally support adding that keyword expressing
| the default. All defaults should be expressible.
| indigo945 wrote:
| I don't find it that confusing, to be honest. When the article
| presented the "mysterious" query, I guessed the result
| correctly. A correlated subquery will always execute the entire
| expression for each row of the outer query. Since the value of
| `sum(a)` is known in the context of the outer query, it doesn't
| really matter where you select it from in the inner query.
| Therefore, the following queries are all identical:
| SELECT sum(a) FROM aa; SELECT (SELECT sum(a) FROM xx
| LIMIT 1) FROM aa; SELECT (SELECT sum(a) FROM
| generate_series(1, 1)) FROM aa;
|
| To maybe explain why this behaviour makes sense, stop thinking
| about aggregate queries for a second, and just consider normal
| expressions. What would you expect the result of the following
| query to be? SELECT (SELECT a + x FROM xx
| LIMIT 1) FROM aa;
|
| This is obviously a bad query because it uses `LIMIT` without
| `ORDER BY`, but let's just put that aside. This query should
| make some sense to pretty much any SQL developer: you take each
| `a` value from `aa`, and then add the value from the "first"
| (whatever that means) `x` value from `xx` to it. (In my test on
| Postgres, this returns [11, 12, 13], but there's obviously no
| guarantee because we neglected to sort `xx`.)
|
| Now, since that was easy, consider this simplification of that
| query: SELECT (SELECT a FROM xx LIMIT 1) FROM
| aa;
|
| I think that if the previous query made sense, this one should
| also make sense, as the expression simple has one less term.
| The result is obviously [1, 2, 3].
|
| So, what should the aggregate return: SELECT
| (SELECT sum(a) FROM xx LIMIT 1) FROM aa;
|
| The only real difference is that the outer query is now
| aggregated, and therefore it is implicitly in one big grouping,
| and therefore returns only one row. So the correlated subquery
| gets run once, for that one row - what's the value for `sum(a)`
| in that row, the one row that the outer query returns? Well, 6.
| Duh.
| foldU wrote:
| (I'm the author of the original post [1]) And while I think
| the rules, in the end, make sense, I think it's not quite as
| clear cut as you describe, consider: SELECT
| (SELECT sum(1) FROM xx LIMIT 1) FROM aa;
|
| which returns 3 3 3
|
| Personally, I think having the inner aggregation always
| attach to the nearest `SELECT` would have been an equally
| valid way of defining how this works, but it just so happens
| it is not defined that way.
|
| [1]: https://buttondown.email/jaffray/archive/sql-scoping-is-
| surp...
| indigo945 wrote:
| Well, the "fun" one to consider is IMO not that, but this:
| SELECT (SELECT sum(a + x) FROM xx LIMIT 1) FROM aa;
|
| Making a pretty diagram where all the behaviours from
| various DBMSs are listed is left as an exercise to the
| author. :)
| foldU wrote:
| I think this one is obvious actually! There's no choice
| but to aggregate it at the level of the `xx` `SELECT`, no
| other level has access to the `x` column.
| MarkusWinand wrote:
| > Making a pretty diagram where all the behaviours from
| various DBMSs are listed is left as an exercise to the
| author. :)
|
| It's there already!
|
| It's in the chart as footnote "b": Outer reference must
| be the only argument (doesn't support F441)
|
| That's the one thing SQL Server doesn't eat. Those that
| are green in the chart work fine in this case.
| tsimionescu wrote:
| You say it's obvious, but it apparently wasn't obvious to the
| developers of Oracle DB, one of the most used databases.
|
| I think the most unexpected thing is that the query engine
| looks into the inner queries to decide whether to return all
| rows of the outer table or some aggregation of them. If there
| had been a group by clause, it would have been much clearer
| that an aggregation must be happening.
|
| Edit: also, the "obvious" logic actually changes completely
| if you look at the related query, SELECT (SELECT sum(a+x)
| FROM xx LIMIT 1) FROM aa;
|
| This returns 63, 66, 69. Similarly, sum(1) will return 3, 3,
| 3.
|
| So, we could very well expect that sum (a) would work like
| that - it would treat each value of `a` as a constant that it
| aggregates over all rows of xx, so it would return 3, 6, 9
| (the limit is anyway ignored if you are aggregating).
| jhoechtl wrote:
| > Oracle DB, one of the most used databases.
|
| Is it still? Maybe 20 years ago but it's been a long time
| that I met an It guy who talked about himself having worked
| with an Oracle DB recently.
|
| It's all PostgreSQL, MS SQL and MySQL/MariaDB where I work.
| da_chicken wrote:
| > I don't find it that confusing, to be honest.
|
| I would say it's not all that complicated what is going on. I
| would say that you can figure it out if you have experience
| reading queries.
|
| I would _never_ say it 's not confusing. Just because the
| query engine can parse and generate an execution plan and you
| can understand how and why it's doing what it decides to do
| doesn't mean it's not confusing. "Confusing" doesn't mean
| "incomprehensible." It is confusing. It's confusing in almost
| the same way that "6/2(1+2) = ?" viral equation is confusing
| [0] or the Fast Inverse Square Root is confusing. This query
| is _unconventional_.
|
| Not "unconventional" in the sense that it's uncommon
| (although it is also that). It's unconventional in the sense
| that _it goes against established conventions_. Humans don 't
| write queries that way. This query violates the collective
| query conventions of the SQL community. People don't write
| queries like this or teach writing queries like this because
| it's not a particularly useful or portable pattern to put the
| aggregate value of the outside query in the SELECT clause of
| an inside correlated subquery. There are better ways to write
| this query. Because of that, it's not clear what the author's
| _intent_ was or why they would choose to write it thus
| instead of, well, nearly any other way that a human would
| actually choose beforehand. It 's so unconventional that the
| intent is uncertain, and "uncertainty of intent" is another
| way to say "confusing."
|
| Code is not valuable because it executes. Code is valuable
| because the programmer, the computer, and any future
| programmer agree on the intent with no uncertainty. That's
| why over the long term clarity is often more potent than
| cleverness.
|
| [0]: https://youtu.be/4x-BcYCiKCk
| nimish wrote:
| Correlated subqueries are absolutely fine and are equivalent to
| (possibly lateral) joins. They aren't as easy to identify and
| optimize as such, so I wouldn't necessarily recommend them, but
| sometimes they make sense. Not an issue with a good query
| planner and if they clarify a query.
| da_chicken wrote:
| Yeah, correlated subqueries are fine. Unless you're on MySQL,
| which (historically, at least) was infamously terrible at
| executing them.
|
| The reason this query works is because correlated
| subqueries... need something to correlate with! Usually the
| column from the outside query is only referenced in the WHERE
| clause, but there isn't really a reason you _can 't_
| reference it in the inside query's SELECT clause. After all,
| what if you needed it for a CASE expression there?
|
| There are certainly _better_ ways to write what this query is
| accomplishing -- at the very least there are ways to write it
| that make it much more clear what your intentions are -- but
| just because it 's weird doesn't mean it's invalid. The
| developers know they cannot anticipate all possible
| operations that might be desired, and there are already
| countless ways to write queries _incorrectly_.
| HDThoreaun wrote:
| As an employee at a database company working on the plan
| optimizer I would like to ask you not to use correlated
| subqueries for my own sanity. Close to 100% of the mind
| blowing bugs are correlated subqueries. Kidding of course but
| holy shit my life would be so much easier without them.
| crdrost wrote:
| For a while now, whenever I have taught SQL to people (which was
| only I think 3 times over like 8 years) I ask them to imagine
| that the word SELECT can be replaced with AGGREGATE which unlocks
| these magical words GROUP BY and SUM and so forth, and they
| understand me perfectly well: and then I drop on them that I was
| lying and the magic word is still SELECT but you have to know
| which kind of SELECT it is from context clues and they are like
| "What???? Whyyyyyy."
| derefr wrote:
| To be honest, teaching SQL "syntax in" seems like a really
| backwards way of getting an understanding of what SQL is doing.
|
| I run the tech side of a company almost entirely built on
| complex SQL queries. If I were to take a fresh college student
| and "teach them SQL"... well, I _wouldn 't_ teach them SQL. I'd
| teach them _relational algebra_ : projection, selection, join,
| etc. as mathematical operators over bags of tuples. I'd make
| sure that they could show me what the bag-of-tuples looks like
| at each step, as each operator is applied. I'd get them to
| write code in some programming language with first-class Set
| and Tuple datatypes, that does these operations as in-memory
| functional transforms over bag-of-tuples -typed variables.
|
| And then, once they have that solid understanding, I'd show
| them, in order:
|
| 1. that there exist these systems called RDBMSes that maintain
| _durable, mutable_ bags-of-tuples they call _tables_ (which is
| weird when you think about it; coming from a math world, a
| relation _should_ be an immutable mathematical object, a value-
| object defined by its members. But RDBMS tables _aren 't_
| relations, not exactly -- and this is where you can go into
| detail about SQL `NULL`, "nullable columns", and how in an
| RDBMS, the absence of an asserted row-tuple in a table doesn't
| necessarily mean the implicit assertion of its logical negation
| as a logical predicate; it rather means "haven't been fed the
| truth-value of that relation yet." RDBMS query engines aren't
| relational-algebra _computers_ ; they're relational-algebra
| _provers_ -- and NULL is their "not proven"!);
|
| 2. that these systems accept a COBOL-like declarative syntax
| for either
|
| * referencing their tables (`TABLE foo`);
|
| * expressing relational algebra upon their tables (`SELECT ...
| FROM foo ...`);
|
| * defining arbitrary literal bags-of-tuples (e.g. `VALUES (1,
| 2)` -- bet you didn't know _that_ was toplevel-grammatical in
| SQL!);
|
| * expressing relational algebra _upon_ arbitrary literal bags-
| of-tuples (e.g. `SELECT a FROM (VALUES (1, 2)) AS x (a, b)`);
|
| * or any combinations of the above -- using UNION, INTERSECT,
| or introducing any of the above as a _common table expression_
| and then referencing it by name.
|
| 3. that these systems also accept COBOL-like _imperative_
| syntax -- one which really has not-much-at-all to do with
| relational algebra -- for modifying the durable, mutable bag-
| of-tuples asserted by a given table (INSERT, UPDATE, DELETE,
| TRUNCATE);
|
| 4. and that, completely unnecessarily, these systems _also_
| have a COBOL-like imperative syntax for modifying the
| definitions of the tables themselves (unnecessary because the
| data-definition schema could have just been, itself, a set of
| tables that you modify through DML statements.
| INFORMATION_SCHEMA exists, but it 's read-only in every RDBMS
| I've ever seen. Wacky, isn't it?)
| uticus wrote:
| > I'd teach them _relational algebra_...
|
| Agreed. Curious what materials you've found that really drive
| this home?
|
| Like 90% of googlesearch results simplify things to the point
| where this is missed.
| random3 wrote:
| So approximately a SQL standard (aggregation "tree-climb")
| combined with a non-standard (LIMIT) which prevents an error
| yield an unexpected behavior.
| gigatexal wrote:
| Please don't use this as an interview question. It is absolutely
| diabolical.
___________________________________________________________________
(page generated 2023-12-05 23:01 UTC)