[HN Gopher] The curious case of the aggregation query
       ___________________________________________________________________
        
       The curious case of the aggregation query
        
       Author : eatonphil
       Score  : 70 points
       Date   : 2023-12-05 14:31 UTC (8 hours ago)
        
 (HTM) web link (modern-sql.com)
 (TXT) w3m dump (modern-sql.com)
        
       | crazygringo wrote:
       | > _Now the mysterious query: SELECT (SELECT sum(a) FROM xx LIMIT
       | 1) FROM aa_
       | 
       | To be clear, that's just an absolutely pathological query, since
       | column "a" doesn't exist in table "xx" -- it only exists in table
       | "aa".
       | 
       | You _can_ use values from outer queries inside of inner queries,
       | which are called  "correlated subqueries". Which is _not_
       | recommended unless you 're already a real SQL expert, because
       | they're so confusing (and their performance is also pretty
       | absymal).
       | 
       | But this isn't even that -- this is an _aggregate function_
       | across an outer variable. I consider myself a strong SQL expert,
       | but looking at this query, it just feels nonsensical. Not only
       | could I not tell you what it would produce -- I couldn 't even
       | guess at what its author intended it to mean in the first place.
       | 
       | > _Later in the article we will see that SQL implementations
       | don't even agree about the correct result._
       | 
       | I'm not surprised at all. I'm actually more surprised they
       | produce results _at all_ , rather than just errors about it being
       | an invalid query.
        
         | MarkusWinand wrote:
         | No question, such a query should not be written. That's
         | probably the reason why this odd behavior, which is even
         | different in various DBMSes, is not causing everyday problems.
        
           | marcosdumay wrote:
           | The reason the behavior doesn't cause problems is because
           | everybody treats automatic aggregation as a voodoo where they
           | know one recipe that works and anything different is domain
           | of the Devil.
           | 
           | And, IMO, that's a very sane and reasonable way to treat it.
           | The entire idea of automatic aggregation is flawed, and those
           | queries should just have a `group by ()` explicit at the
           | right place.
        
             | foldU wrote:
             | I agree, I think the original sin here is the fact that
             | whether a `SELECT` is an aggregation is determined by the
             | contents of the scalar expressions at all. I think most of
             | this weirdness comes directly out of wanting to be able to
             | write both `SELECT sum(x) FROM xx` and `SELECT x FROM xx`
             | and have them work.
             | 
             | Not that I have a better solution offhand, in SQL grouping
             | by a constant value is not actually the same as not writing
             | `GROUP BY` at all since the behaviour on empty tables is
             | different.
        
               | marcosdumay wrote:
               | Not writing `group by` is the same as writing `group by
               | ()`.
               | 
               | And yeah, the difference between that and a value is one
               | of those really surprising things on SQL that actually
               | make sense and should be this way. Unfortunately, there
               | are many of those.
        
               | foldU wrote:
               | Ah my apologies, I wasn't familiar with that syntax.
        
               | marcosdumay wrote:
               | No need to apologize. My post didn't make that clear at
               | all.
        
               | derefr wrote:
               | What's an aggregation per se? A SQL query is best thought
               | of an arbitrary generator function. You can, using
               | functions like UNNEST, end up emitting _multiple_ row-
               | tuples for each processed input row-tuple. An aggregation
               | is just a generator function that happens to reduce over
               | all the input rows and then emit one row-tuple when there
               | 's no more input. Query-planning engines do not special-
               | case this. It's just a generator node like any other
               | generator node.
               | 
               | Consider: using window functions, you can do _partial_
               | aggregations over subsets of the input -- without even
               | necessarily partitioning the input (i.e. you can compute
               | "running totals" and other wacky state-machine-like
               | outputs.)
        
             | indigo945 wrote:
             | Would you argue that automatic scalar-ism is also flawed,
             | and the query                   SELECT a FROM aa;
             | 
             | should have an explicit grouping, like
             | SELECT a FROM aa GROUP BY id;
             | 
             | ?
             | 
             | After all, when you think about it, it's really not the
             | aggregate functions that break expectations, it's the
             | scalars. Of the four combinations of having or not having
             | an explicit `GROUP BY` and having or not having an
             | aggregate function, three of them have the unsurprising
             | behavior of returning exactly one row per grouping.
             | -- aggregate function and group by - one single result per
             | grouping         SELECT sum(a) FROM aa GROUP BY a; --or
             | SELECT sum(a) FROM aa GROUP BY ();                   -- no
             | aggregate function, but a group by - one single result per
             | grouping         SELECT a FROM aa GROUP BY a;
             | -- just aggregate function, but no group by - one single
             | result per (single, implicit) grouping         SELECT
             | sum(a) FROM aa;
             | 
             | But then, when you have neither an aggregate function nor
             | an explicit `GROUP BY`, it breaks expectations:
             | -- no aggregate function, no group by - one result row per
             | row          -- in the source set, even though there should
             | be only one big implicit group         SELECT a FROM aa;
             | 
             | Therefore, I propose that the next SQL standard should
             | introduce a new `GROUP BY INDIVIDUAL ROW` keyword, that
             | henceforth all "scalar" queries MUST use in order to have
             | consistent behaviour with the rest of the language.
        
               | marcosdumay wrote:
               | What is flawed is not that there is an implicit grouping
               | on all queries. It's that the implicit grouping changes,
               | depending on a set of rules that consider stuff written
               | in several places that are not explicitly related to it.
               | 
               | You are asking if it makes sense to have an implicit
               | grouping at all; it very obviously absolutely does. And
               | grouping by individual row is the very obvious default.
               | But I do totally support adding that keyword expressing
               | the default. All defaults should be expressible.
        
         | indigo945 wrote:
         | I don't find it that confusing, to be honest. When the article
         | presented the "mysterious" query, I guessed the result
         | correctly. A correlated subquery will always execute the entire
         | expression for each row of the outer query. Since the value of
         | `sum(a)` is known in the context of the outer query, it doesn't
         | really matter where you select it from in the inner query.
         | Therefore, the following queries are all identical:
         | SELECT sum(a) FROM aa;         SELECT (SELECT sum(a) FROM xx
         | LIMIT 1) FROM aa;         SELECT (SELECT sum(a) FROM
         | generate_series(1, 1)) FROM aa;
         | 
         | To maybe explain why this behaviour makes sense, stop thinking
         | about aggregate queries for a second, and just consider normal
         | expressions. What would you expect the result of the following
         | query to be?                   SELECT (SELECT a + x FROM xx
         | LIMIT 1) FROM aa;
         | 
         | This is obviously a bad query because it uses `LIMIT` without
         | `ORDER BY`, but let's just put that aside. This query should
         | make some sense to pretty much any SQL developer: you take each
         | `a` value from `aa`, and then add the value from the "first"
         | (whatever that means) `x` value from `xx` to it. (In my test on
         | Postgres, this returns [11, 12, 13], but there's obviously no
         | guarantee because we neglected to sort `xx`.)
         | 
         | Now, since that was easy, consider this simplification of that
         | query:                   SELECT (SELECT a FROM xx LIMIT 1) FROM
         | aa;
         | 
         | I think that if the previous query made sense, this one should
         | also make sense, as the expression simple has one less term.
         | The result is obviously [1, 2, 3].
         | 
         | So, what should the aggregate return:                   SELECT
         | (SELECT sum(a) FROM xx LIMIT 1) FROM aa;
         | 
         | The only real difference is that the outer query is now
         | aggregated, and therefore it is implicitly in one big grouping,
         | and therefore returns only one row. So the correlated subquery
         | gets run once, for that one row - what's the value for `sum(a)`
         | in that row, the one row that the outer query returns? Well, 6.
         | Duh.
        
           | foldU wrote:
           | (I'm the author of the original post [1]) And while I think
           | the rules, in the end, make sense, I think it's not quite as
           | clear cut as you describe, consider:                   SELECT
           | (SELECT sum(1) FROM xx LIMIT 1) FROM aa;
           | 
           | which returns                   3         3         3
           | 
           | Personally, I think having the inner aggregation always
           | attach to the nearest `SELECT` would have been an equally
           | valid way of defining how this works, but it just so happens
           | it is not defined that way.
           | 
           | [1]: https://buttondown.email/jaffray/archive/sql-scoping-is-
           | surp...
        
             | indigo945 wrote:
             | Well, the "fun" one to consider is IMO not that, but this:
             | SELECT (SELECT sum(a + x) FROM xx LIMIT 1) FROM aa;
             | 
             | Making a pretty diagram where all the behaviours from
             | various DBMSs are listed is left as an exercise to the
             | author. :)
        
               | foldU wrote:
               | I think this one is obvious actually! There's no choice
               | but to aggregate it at the level of the `xx` `SELECT`, no
               | other level has access to the `x` column.
        
               | MarkusWinand wrote:
               | > Making a pretty diagram where all the behaviours from
               | various DBMSs are listed is left as an exercise to the
               | author. :)
               | 
               | It's there already!
               | 
               | It's in the chart as footnote "b": Outer reference must
               | be the only argument (doesn't support F441)
               | 
               | That's the one thing SQL Server doesn't eat. Those that
               | are green in the chart work fine in this case.
        
           | tsimionescu wrote:
           | You say it's obvious, but it apparently wasn't obvious to the
           | developers of Oracle DB, one of the most used databases.
           | 
           | I think the most unexpected thing is that the query engine
           | looks into the inner queries to decide whether to return all
           | rows of the outer table or some aggregation of them. If there
           | had been a group by clause, it would have been much clearer
           | that an aggregation must be happening.
           | 
           | Edit: also, the "obvious" logic actually changes completely
           | if you look at the related query, SELECT (SELECT sum(a+x)
           | FROM xx LIMIT 1) FROM aa;
           | 
           | This returns 63, 66, 69. Similarly, sum(1) will return 3, 3,
           | 3.
           | 
           | So, we could very well expect that sum (a) would work like
           | that - it would treat each value of `a` as a constant that it
           | aggregates over all rows of xx, so it would return 3, 6, 9
           | (the limit is anyway ignored if you are aggregating).
        
             | jhoechtl wrote:
             | > Oracle DB, one of the most used databases.
             | 
             | Is it still? Maybe 20 years ago but it's been a long time
             | that I met an It guy who talked about himself having worked
             | with an Oracle DB recently.
             | 
             | It's all PostgreSQL, MS SQL and MySQL/MariaDB where I work.
        
           | da_chicken wrote:
           | > I don't find it that confusing, to be honest.
           | 
           | I would say it's not all that complicated what is going on. I
           | would say that you can figure it out if you have experience
           | reading queries.
           | 
           | I would _never_ say it 's not confusing. Just because the
           | query engine can parse and generate an execution plan and you
           | can understand how and why it's doing what it decides to do
           | doesn't mean it's not confusing. "Confusing" doesn't mean
           | "incomprehensible." It is confusing. It's confusing in almost
           | the same way that "6/2(1+2) = ?" viral equation is confusing
           | [0] or the Fast Inverse Square Root is confusing. This query
           | is _unconventional_.
           | 
           | Not "unconventional" in the sense that it's uncommon
           | (although it is also that). It's unconventional in the sense
           | that _it goes against established conventions_. Humans don 't
           | write queries that way. This query violates the collective
           | query conventions of the SQL community. People don't write
           | queries like this or teach writing queries like this because
           | it's not a particularly useful or portable pattern to put the
           | aggregate value of the outside query in the SELECT clause of
           | an inside correlated subquery. There are better ways to write
           | this query. Because of that, it's not clear what the author's
           | _intent_ was or why they would choose to write it thus
           | instead of, well, nearly any other way that a human would
           | actually choose beforehand. It 's so unconventional that the
           | intent is uncertain, and "uncertainty of intent" is another
           | way to say "confusing."
           | 
           | Code is not valuable because it executes. Code is valuable
           | because the programmer, the computer, and any future
           | programmer agree on the intent with no uncertainty. That's
           | why over the long term clarity is often more potent than
           | cleverness.
           | 
           | [0]: https://youtu.be/4x-BcYCiKCk
        
         | nimish wrote:
         | Correlated subqueries are absolutely fine and are equivalent to
         | (possibly lateral) joins. They aren't as easy to identify and
         | optimize as such, so I wouldn't necessarily recommend them, but
         | sometimes they make sense. Not an issue with a good query
         | planner and if they clarify a query.
        
           | da_chicken wrote:
           | Yeah, correlated subqueries are fine. Unless you're on MySQL,
           | which (historically, at least) was infamously terrible at
           | executing them.
           | 
           | The reason this query works is because correlated
           | subqueries... need something to correlate with! Usually the
           | column from the outside query is only referenced in the WHERE
           | clause, but there isn't really a reason you _can 't_
           | reference it in the inside query's SELECT clause. After all,
           | what if you needed it for a CASE expression there?
           | 
           | There are certainly _better_ ways to write what this query is
           | accomplishing -- at the very least there are ways to write it
           | that make it much more clear what your intentions are -- but
           | just because it 's weird doesn't mean it's invalid. The
           | developers know they cannot anticipate all possible
           | operations that might be desired, and there are already
           | countless ways to write queries _incorrectly_.
        
           | HDThoreaun wrote:
           | As an employee at a database company working on the plan
           | optimizer I would like to ask you not to use correlated
           | subqueries for my own sanity. Close to 100% of the mind
           | blowing bugs are correlated subqueries. Kidding of course but
           | holy shit my life would be so much easier without them.
        
       | crdrost wrote:
       | For a while now, whenever I have taught SQL to people (which was
       | only I think 3 times over like 8 years) I ask them to imagine
       | that the word SELECT can be replaced with AGGREGATE which unlocks
       | these magical words GROUP BY and SUM and so forth, and they
       | understand me perfectly well: and then I drop on them that I was
       | lying and the magic word is still SELECT but you have to know
       | which kind of SELECT it is from context clues and they are like
       | "What???? Whyyyyyy."
        
         | derefr wrote:
         | To be honest, teaching SQL "syntax in" seems like a really
         | backwards way of getting an understanding of what SQL is doing.
         | 
         | I run the tech side of a company almost entirely built on
         | complex SQL queries. If I were to take a fresh college student
         | and "teach them SQL"... well, I _wouldn 't_ teach them SQL. I'd
         | teach them _relational algebra_ : projection, selection, join,
         | etc. as mathematical operators over bags of tuples. I'd make
         | sure that they could show me what the bag-of-tuples looks like
         | at each step, as each operator is applied. I'd get them to
         | write code in some programming language with first-class Set
         | and Tuple datatypes, that does these operations as in-memory
         | functional transforms over bag-of-tuples -typed variables.
         | 
         | And then, once they have that solid understanding, I'd show
         | them, in order:
         | 
         | 1. that there exist these systems called RDBMSes that maintain
         | _durable, mutable_ bags-of-tuples they call _tables_ (which is
         | weird when you think about it; coming from a math world, a
         | relation _should_ be an immutable mathematical object, a value-
         | object defined by its members. But RDBMS tables _aren 't_
         | relations, not exactly -- and this is where you can go into
         | detail about SQL `NULL`, "nullable columns", and how in an
         | RDBMS, the absence of an asserted row-tuple in a table doesn't
         | necessarily mean the implicit assertion of its logical negation
         | as a logical predicate; it rather means "haven't been fed the
         | truth-value of that relation yet." RDBMS query engines aren't
         | relational-algebra _computers_ ; they're relational-algebra
         | _provers_ -- and NULL is their  "not proven"!);
         | 
         | 2. that these systems accept a COBOL-like declarative syntax
         | for either
         | 
         | * referencing their tables (`TABLE foo`);
         | 
         | * expressing relational algebra upon their tables (`SELECT ...
         | FROM foo ...`);
         | 
         | * defining arbitrary literal bags-of-tuples (e.g. `VALUES (1,
         | 2)` -- bet you didn't know _that_ was toplevel-grammatical in
         | SQL!);
         | 
         | * expressing relational algebra _upon_ arbitrary literal bags-
         | of-tuples (e.g. `SELECT a FROM (VALUES (1, 2)) AS x (a, b)`);
         | 
         | * or any combinations of the above -- using UNION, INTERSECT,
         | or introducing any of the above as a _common table expression_
         | and then referencing it by name.
         | 
         | 3. that these systems also accept COBOL-like _imperative_
         | syntax -- one which really has not-much-at-all to do with
         | relational algebra -- for modifying the durable, mutable bag-
         | of-tuples asserted by a given table (INSERT, UPDATE, DELETE,
         | TRUNCATE);
         | 
         | 4. and that, completely unnecessarily, these systems _also_
         | have a COBOL-like imperative syntax for modifying the
         | definitions of the tables themselves (unnecessary because the
         | data-definition schema could have just been, itself, a set of
         | tables that you modify through DML statements.
         | INFORMATION_SCHEMA exists, but it 's read-only in every RDBMS
         | I've ever seen. Wacky, isn't it?)
        
           | uticus wrote:
           | > I'd teach them _relational algebra_...
           | 
           | Agreed. Curious what materials you've found that really drive
           | this home?
           | 
           | Like 90% of googlesearch results simplify things to the point
           | where this is missed.
        
       | random3 wrote:
       | So approximately a SQL standard (aggregation "tree-climb")
       | combined with a non-standard (LIMIT) which prevents an error
       | yield an unexpected behavior.
        
       | gigatexal wrote:
       | Please don't use this as an interview question. It is absolutely
       | diabolical.
        
       ___________________________________________________________________
       (page generated 2023-12-05 23:01 UTC)