hngopher.com

       [HN Gopher] Introduction to Window Functions in SQL
       ___________________________________________________________________
        
       Introduction to Window Functions in SQL
        
       Author : adilkhash
       Score  : 182 points
       Date   : 2021-01-06 11:02 UTC (11 hours ago)
        
 (HTM) web link (khashtamov.com)
 (TXT) w3m dump (khashtamov.com)
        
       | dmitryminkovsky wrote:
       | I've been working on a messaging app, and at the beginning I
       | started out with Elasticsearch. Now, ES is really a wonderful
       | tool and it got me pretty far, but its lack of window functions
       | made one particular requirement impossible that we all take for
       | granted in messengers: a listing of most-recently updated threads
       | represented by the most recent message on each thread, sorted
       | reverse chronologically. Here is the SQL that makes that happen:
       | SELECT         tm.*       FROM         (           SELECT
       | row_number() OVER (               PARTITION BY tm_1.thread_id
       | ORDER BY tm_1.delivered_at DESC             ) AS "row",
       | tm_1.*           FROM             thread_message tm_1         )
       | tm       WHERE         tm."row" = 1;
       | 
       | The inner query groups all messages by thread, orders them to
       | find the "most recent" message on a thread given my ordering
       | requirements, and then assigns a row number to each such message
       | such that I can pick the most recent message on each thread in
       | the outside query. I still have no idea how I would have done
       | this with Elasticsearch.
        
       | hospadar wrote:
       | Wow I love me a window function!
       | 
       | Favorite/weirdest thing I learned to do most recently with window
       | functions is using/abusing min() and max(). If there is no value
       | in a column in a particular window, min and max will give NULL.
       | 
       | We use this to do really interesting stuff (entirely in sql!)
       | like "partition usage into sessions where a session is defined as
       | a bunch of activity by a user where they did an action at least
       | once every 30 minutes"
        
         | marcosdumay wrote:
         | SQL could really use some standard aggregation function that
         | returns null for no rows, the value of a single row and error
         | if there is more than one row.
         | 
         | That kind of query happens all the time, and min/max behavior
         | isn't the safest one.
        
       | n4r9 wrote:
       | I like how the article teaches by explicit example. Very helpful
       | when it comes to data.
       | 
       | The "lag" and "lead" SQL functions are particularly useful when
       | analysing sequences of timestamped events (GPS fixes, task
       | completions etc). They allow you to easily return the delta of a
       | value (time/distance) between the current and previous row, which
       | is really useful if you want to compare expected vs actual on
       | those deltas.
        
       | forinti wrote:
       | A few cool tricks I use with window functions:
       | 
       | 1- To find blocks of contiguous values, you can use something
       | similar to Gauss' trick for calculating arithmetic progressions:
       | sort them by descending order and add each value to the row
       | number. All contiguous values will add to the same number. You
       | can then apply max/min and get rows that correspond to the blocks
       | of values.                   select min(n), max(n) from (
       | select n, n+row_number() over (order by n desc) group
       | from numbers         )         group by group         order by 1
       | 
       | 2- You can use a window function with exponential/logarithms in
       | order to calculate the accumulated inflation for the last n
       | months:                   select date, inflation,
       | (exp(accumulated)-1)*100 from (           select date, inflation,
       | sum(ln(1+(inflation/100))) over (order by date desc rows between
       | current row and 11 following) as accumulated           from
       | inflation         )
       | 
       | 3- You can do all the paging in SQL (fetch page n of m) or simply
       | add a column with the total number of rows (this often makes it
       | easier to process the results).
        
         | pascalmahe wrote:
         | Could you please develop how to do paging with window
         | functions?
         | 
         | It's been some time since I've done serious work with SQL yet I
         | remember that all paging solutions I've found (eg. top results
         | on SO) are always platform-specific so having a platform-
         | independent way of doing it would mean I could finally try to
         | remember it.
        
           | forinti wrote:
           | Try this:                   select ord, table_name,
           | first_name, last_name, total_rows          from (
           | select table_name,                    row_number() over
           | (order by table_name asc) ord,                    count(1)
           | over () total_rows,             from all_tables
           | where table_name like 'B%'             order by table_name
           | desc         ) where  ord between 10 and 19         order by
           | ord
           | 
           | This is basically what Oracle Apex does. You can do this
           | without window functions, but it gets a bit more complicated.
           | 
           | I like to add total_rows so that I can show a "rows 10 - 19
           | of 81" header. You can also get the very first and last
           | values by adding theses columns:
           | first_value(table_name) over (order by table_name asc rows
           | between unbounded preceding and unbounded following)
           | first_name,         last_value(table_name) over (order by
           | table_name asc rows between unbounded preceding and unbounded
           | following) last_name
           | 
           | Alternatively, you could get the very last and very first
           | whole rows by changing the outer where clause:
           | ord=1 or rownum=1 or ord between 10 and 19
           | 
           | And then it would be easy to have a nice header with "rows 10
           | - 19 of 81 (Algiers to Zimbabwe)" with very little code.
        
           | bmn__ wrote:
           | If you mean pagination, then window functions are not the
           | right tool. Keyset pagination in standard SQL:
           | 
           | * https://use-the-index-luke.com/sql/partial-results/fetch-
           | nex...
           | 
           | * https://use-the-index-luke.com/no-offset
        
             | e12e wrote:
             | Both of these are bad for pagination - if the dataset isn't
             | read-only. If the search criteria matches 100 rows at the
             | time of the request, you may want to page through _those_
             | matches, even if, by the time the client (human or machine)
             | gets to page three, the query matches 80 or 110 rows - or
             | worse, if the query still matches 100 rows, but not all of
             | them are the same as the original 100!
             | 
             | You would normally capture such state by using cursors.
        
               | fuy wrote:
               | Could you elaborate on "using cursors" part? Are you
               | talking about database-level cursors? If so, do you know
               | any resources which cover this approach?
        
           | hospadar wrote:
           | row_number() !
           | 
           | As in: SELECT * FROM (SELECT *, row_number() OVER (order by
           | <some column - primary key column would be a good choice>) as
           | rowidx FROM your_table) as numbered WHERE rowidx > ? AND
           | rowid < ?
        
       | stelliosk wrote:
       | Analytic functions as they are known by in Oracle, SQL have been
       | around for more than a decade. They are useful for calculating
       | running balances, getting a previous or next row value, doing a
       | "group by" for a subset of columns.. etc. They're the best thing
       | in SQL "since sliced bread".
       | 
       | https://towardsdatascience.com/analytical-functions-in-oracl...
        
       | twic wrote:
       | What i like most about window functions is that they give me a
       | way to do a sort of 'extended group by' which i have always
       | wanted.
       | 
       | If you want to know the highest salary in each department, that's
       | easy:                 select department, max(gross_salary)
       | from salary       group by department
       | 
       | If you want to know who it is who earns that salary, you might
       | try to do this:                 select department, first_name,
       | max(gross_salary)       from salary       group by department
       | 
       | But this doesn't work, because it's meaningless to ask for
       | first_name in a situation where you're grouping by department.
       | You could ask for an aggregation of all names, but there's no
       | straightforward way to ask for the name of the person who earned
       | that salary. You end up having to write a join against the group
       | by, as in the article, which is pretty grim, and falls apart if
       | you want to order by multiple columns to break ties.
       | 
       | Window functions let you re-frame this kind of group by like
       | this:                 select department, gross_salary       from
       | (         select *, row_number() over (partition by department
       | order by gross_salary desc) as n         from salary       ) _
       | where n = 1
       | 
       | Because the outer query is no longer a group by, you can select
       | any columns you like. The natural query works fine:
       | select department, first_name, gross_salary       from (
       | select *, row_number() over (partition by department order by
       | gross_salary desc) as n         from salary       ) _       where
       | n = 1
       | 
       | This only works where the group by is based on an aggregate
       | function that picks one value, like min or max. I somewhat think
       | it was a mistake to model that kind of thing as an aggregation in
       | the first place. If SQL had a way of picking one row from a
       | group, rather than aggregating over it, that would be immensely
       | useful.
        
         | radiospiel wrote:
         | > If SQL had a way of picking one row from a group, rather than
         | aggregating over it, that would be immensely useful.
         | 
         | Well, there is a way which is window functions :) as shown by
         | you.
         | 
         | The idea to expect exactly one first name of the person with
         | the biggest salary is kinda wrong, since there can be more than
         | one person, and this can obviously not described as a single
         | column per group.
         | 
         | Note that aggregating is not limited to min, max, sum, etc.
         | Postgres, for example, has array_agg which aggregates
         | individual columns from each record of a group into an array,
         | if that becomes necessary.
        
           | twic wrote:
           | > The idea to expect exactly one first name of the person
           | with the biggest salary is kinda wrong, since there can be
           | more than one person, and this can obviously not described as
           | a single column per group.
           | 
           | That's true. For this idea to work, there would need to be
           | some framework for picking exactly one row from a group. The
           | simplest thing would be to raise an error if there were
           | multiple candidates, but perhaps there are better ways. I
           | think the operation of picking exactly zero or one things
           | from a group is so common that it's worth making provision
           | for it.
        
         | frankmcsherry wrote:
         | > If SQL had a way of picking one row from a group, rather than
         | aggregating over it, that would be immensely useful.
         | 
         | You can do this with a LATERAL join, if you want to avoid the
         | jankiness of window functions. Lateral joins are just a
         | programmatic way to introduce a correlated subquery. For
         | example                   SELECT department, first_name,
         | gross_salary FROM             (SELECT DISTINCT department FROM
         | salary) depts,             LATERAL (                 SELECT
         | first_name, gross_salary                  FROM salary
         | WHERE department = depts.department                  ORDER BY
         | gross_salary DESC                 LIMIT 3             )
         | 
         | This uses a limit of 3 to show off top-3 instead of just
         | argmax, but you could clearly set that to one if you wanted.
         | This construction can be pretty handy if you need the per-group
         | rows to be something other than what a window function could
         | handle.
        
       | davidhyde wrote:
       | I find window functions to be an excellent way to find the max
       | version of a set of things. The trick is to partition by some
       | columns (similar to how you would use a group by), order by
       | descending on your version number field, and use the row_number()
       | function which is very lightweight. Then you filter for all
       | entries where rownumber = 1 and voila you have the max version
       | without having to link back on yourself!
        
         | motogpjimbo wrote:
         | Sadly, I've worked on a lot of codebases where even the "link
         | back on yourself" trick was unknown to the developers,
         | generally leading to devs dumping the entire dataset into an
         | array and then manually iterating over it to find the max
         | values. I wish developers would learn more SQL at the start of
         | their careers...
        
         | aaronharnly wrote:
         | Yes! I use this daily.
        
       | llampx wrote:
       | Question for the pros: In doing some data engineering work, I
       | found that creating temporary tables and dropping them after the
       | run was much more performant and memory-efficient than using
       | CTEs. No other change was made to the queries in the CTE, just
       | putting them in a separate CREATE TABLE AS... script before the
       | part that needed the calculations.
       | 
       | Why is this the case? Shouldn't CTEs be more efficient?
        
         | jsmith45 wrote:
         | You already got a good response for postgres.
         | 
         | If you where speaking of MSSQL, it does treats non-recursive[1]
         | CTE's as equivalent to to a view, and effectively inline it. So
         | if things were faster with a temp table than a CTE it would
         | mean that the query optimizer came up with a bad query plan.
         | The most likely reason being that your query was complicated
         | enough that you confused the cardinality estimator, so it ends
         | up choosing really bad join types.
         | 
         | [1] Recursive queries in MSSQL are more complicated. They
         | spool, but not necessarily all rows get spooled right away. So
         | an infinite result recursive CTE does not necessarily fail. In
         | any case predicate pushdown still occurs.
        
         | eatonphil wrote:
         | Until postgres 12 (if you were using postgres) CTEs were an
         | optimization fence. Where filters would not get pushed down
         | into the CTE if they were only specified outside the CTE (but
         | in fields from the CTE).
         | 
         | https://paquier.xyz/postgresql-2/postgres-12-with-materializ...
        
           | deepsun wrote:
           | Well, filters won't get pushed down to previous CREATE TABLE
           | either.
        
         | MSM wrote:
         | Really depends on query complexity. Keep in mind that creating
         | a CTE doesn't really materialize any of the resulting steps'
         | data, it's just a way to encapsulate logic. Chaining together
         | complex CTEs is effectively the same as creating layers of
         | views, which can also be a performance problem.
         | 
         | In the case of a temp table, you're likely intelligently
         | (manually) going through the process of whittling down the data
         | to only what you need and in the most efficient ways possible.
         | Maybe you're filtering by date upfront, then by some other
         | filter, removing as many rows as is possible as quickly as
         | possible. With the CTE(s), it might be creating a list of many
         | filters with a lot of joins in one step, and the optimizer may
         | not understand which filters can remove the most rows the most
         | easily.
        
           | deepsun wrote:
           | > creating a CTE doesn't really materialize any of the
           | resulting steps' data
           | 
           | Well, at least with PostgreSQL -- all CTEs were always
           | materialized. It was optimized only in the version 12.
        
       | throw_m239339 wrote:
       | Introduction to Window Functions with the same examples (salaries
       | per department)
       | 
       | https://www.postgresql.org/docs/13/tutorial-window.html
       | 
       | While you are at it, check out common table expressions as well
       | 
       | https://www.postgresql.org/docs/13/queries-with.html
        
         | adilkhash wrote:
         | okay, thanks :)
        
       | Strs2FillMyDrms wrote:
       | I've been encountering a lot of issues, when doing dedicated
       | queries, specially when a team is small, I've found is sometimes
       | more useful to just do the grouping or left join (if more than 2
       | are required) manually.
       | 
       | The problem is ofc scalability, if your team is small it is
       | better to just have some few, but very specific queries, and do
       | whatever transactions required on a layer above.
       | 
       | It may be the difference between correcting a few lines, vs
       | reworking 20+ different queries, and testing each one
       | individually.
        
       | fronofro wrote:
       | I made gifs to explain the different types of window functions:
       | https://dataschool.com/how-to-teach-people-sql/how-window-fu...
        
       | kartaevt wrote:
       | Finally, comprehensive article about Windows Functions.
        
         | resynth1943 wrote:
         | I thought the same. Assumed it was some Microsoft BS.
         | Thankfully it's not :D
        
       | extrememacaroni wrote:
       | SQL is only secondary in my work but I'd rather go with the
       | explicit and verbose queries rather than with stuff with
       | surprises like how the last_value gives wrong results when used
       | the way you'd expect i.e. the reverse of first_value.
        
         | adilkhash wrote:
         | if you understand the window ranges, especially the default one
         | there is no surprise at all
        
       | llampx wrote:
       | Previous discussion on SQL Window functions:
       | https://news.ycombinator.com/item?id=20872114
       | 
       | Helpful resource: http://www.windowfunctions.com/
        
         | adilkhash wrote:
         | i would also recommend the wonderful https://modern-sql.com/
        
       | morrbo wrote:
       | I've always wondered and perhaps someone here might know....do
       | postgres' CTEs translate into big select/sub select queries under
       | the hood? Or are they something special entirely? Ie (forgive
       | formatting as I'm on phone) does:
       | 
       | With mything as ( Select * from table where... ),
       | 
       | Myotherthing as ( Select * from mything where... )
       | 
       | Get translated to
       | 
       | Select * from (select * from ( select * from...)...)...)
       | 
       | So I'm just wondering are CTEs just easier to read, or do they
       | offer any other known optimizations? We use them loads but mainly
       | just for keeping larger queries easier to manage
        
         | virgilp wrote:
         | In theory, both forms should get optimized similarly by the DB,
         | but the practice will likely differ from database to database,
         | and maybe even from DB version to DB version.
         | 
         | There are though things that CTEs can do and sub-selects can't
         | (e.g. WITH RECURSIVE)
        
           | morrbo wrote:
           | Didn't know about WITH RECURSIVE, very cool - thanks
        
         | avianlyric wrote:
         | Not 100% about the newest versions of Postgres. But certainly
         | in older versions CTEs created query planner boundaries.
         | 
         | So the planner would optimise each CTE separately, but wouldn't
         | optimise the entire query with all CTEs together, which of
         | course can result in some slightly nonsensical query plans.
         | 
         | To my knowledge this is considered a limitation rather than a
         | feature as it can cause performance issues with some queries.
         | 
         | EDIT: Some quick Googling suggests the CTEs are no longer
         | optimisation fences, and their handling can be tweaked on a CTE
         | by CTE basic in Postgres 12 onwards [1]
         | 
         | [1] https://www.depesz.com/2019/02/19/waiting-for-
         | postgresql-12-...
        
           | tda wrote:
           | In postgresql 12 a patch [0] is merged to remove the
           | optimization fence certain conditions. From the changelog [1]
           | 
           | " Allow common table expressions (CTEs) to be inlined into
           | the outer query (Andreas Karlsson, Andrew Gierth, David
           | Fetter, Tom Lane)
           | 
           | Specifically, CTEs are automatically inlined if they have no
           | side-effects, are not recursive, and are referenced only once
           | in the query. Inlining can be prevented by specifying
           | MATERIALIZED, or forced for multiply-referenced CTEs by
           | specifying NOT MATERIALIZED. Previously, CTEs were never
           | inlined and were always evaluated before the rest of the
           | query. "
           | 
           | [0] https://git.postgresql.org/gitweb/?p=postgresql.git;a=com
           | mit... [1] https://www.postgresql.org/docs/12/release-12.html
        
         | isbvhodnvemrwvn wrote:
         | It's a bit hidden (or rather not pronounced enough) in the
         | general docs but there are performance implications of using
         | CTEs:
         | 
         | https://www.postgresql.org/docs/13/queries-with.html
         | 
         | > A useful property of WITH queries is that they are normally
         | evaluated only once per execution of the parent query, even if
         | they are referred to more than once by the parent query or
         | sibling WITH queries. Thus, expensive calculations that are
         | needed in multiple places can be placed within a WITH query to
         | avoid redundant work. Another possible application is to
         | prevent unwanted multiple evaluations of functions with side-
         | effects. However, the other side of this coin is that the
         | optimizer is not able to push restrictions from the parent
         | query down into a multiply-referenced WITH query, since that
         | might affect all uses of the WITH query's output when it should
         | affect only one. The multiply-referenced WITH query will be
         | evaluated as written, without suppression of rows that the
         | parent query might discard afterwards. (But, as mentioned
         | above, evaluation might stop early if the reference(s) to the
         | query demand only a limited number of rows.)
         | 
         | > However, if a WITH query is non-recursive and side-effect-
         | free (that is, it is a SELECT containing no volatile functions)
         | then it can be folded into the parent query, allowing joint
         | optimization of the two query levels. By default, this happens
         | if the parent query references the WITH query just once, but
         | not if it references the WITH query more than once. You can
         | override that decision by specifying MATERIALIZED to force
         | separate calculation of the WITH query, or by specifying NOT
         | MATERIALIZED to force it to be merged into the parent query.
         | The latter choice risks duplicate computation of the WITH
         | query, but it can still give a net savings if each usage of the
         | WITH query needs only a small part of the WITH query's full
         | output.
         | 
         | There are some examples below this text as well.
        
           | morrbo wrote:
           | thanks very much for the reply, it was a case of RTFM on my
           | end!
        
       ___________________________________________________________________
       (page generated 2021-01-06 23:02 UTC)