https://khashtamov.com/en/sql-window-functions/ Toggle navigation * Blog * Notes * Archives * Contacts * Na russkom Introduction to Window Functions in SQL 3 Jan. 2020, Data engineering, 23494 views It is interesting that many people working with data have no clue about window functions in SQL. During the long period of time instead of using window functions I prefered coding on Python and pandas. Today I would like to introduce you to the concept of "window" and how it is related to data extraction from a SQL database. Window functions are applied to a subset/window of rows related to one another. In comparison to GROUP BY operation, window functions do not decrease the number of rows. Aggregate functions like AVG, SUM, COUNT could be used as window functions as well. Usually window functions are used to do analytical tasks. The following examples of queries will be performed on PostgreSQL database. Syntax () OVER ( -- optional ) Looks creepy We need more practice. Let's assume that we have a salary table: [salary_table_1] One day your boss approaches you, he wants to know who is the highest paid employee by department. We can use the MAX function to select the highest salary by each department. SELECT department, MAX(gross_salary) as max_salary FROM Salary GROUP BY 1; The result is: [INS::INS] [department_max_salary] In order to select the person with the highest salary by department we can use a subquery and JOIN: SELECT id, first_name, department, t.gross_salary FROM Salary JOIN ( SELECT department, MAX(gross_salary) as gross_salary FROM Salary GROUP BY 1 ) t USING(gross_salary, department); The result is: [highest_paid_employees] But the query looks verbose and dirty, maybe we can do the same with window functions? Sure. All you need to do is to set up a window for MAX aggregate function: SELECT id, first_name, department, gross_salary, MAX(gross_salary) OVER (PARTITION BY department) as max_gross_salary FROM Salary; The column department is a window. It means that employees from the same department is a subset or window of related rows. The result of the query: [window_functions_highest_salary] As you can see we have the new column called max_gross_salary which shows the highest salary in each department. In order to retrieve the list of most paid employees we do the following: SELECT * FROM ( SELECT id, first_name, department, gross_salary, MAX(gross_salary) OVER (PARTITION BY department) as max_gross_salary FROM Salary ) t WHERE max_gross_salary = gross_salary ORDER BY id; Now your boss asks you to prepare the report which shows the ratio of salaries by department and by the total wage fund because some departments have relatively low-income employees in comparison with other departments. You can solve it using subqueries: WITH gross_by_departments AS ( SELECT department, SUM(gross_salary) as dep_gross_salary FROM Salary GROUP BY 1 ) SELECT id, first_name, department, gross_salary, ROUND(CAST(gross_salary AS numeric(9, 2)) / dep_gross_salary * 100, 2) as dep_ratio, ROUND(CAST(gross_salary AS numeric(9, 2)) / (SELECT SUM(gross_salary) FROM Salary) * 100, 2) as total_ratio FROM Salary JOIN gross_by_departments USING(department) ORDER BY department, dep_ratio DESC The result is: [subqueries_salary_ratios] As you can see Nina earns 71.4% out of HR department, but it is only 10.7% out of total wage fund. Arkadii on the other hand has 21.4% out of total wage fund and 41% out of IT department. Can we refactor the query to make it smaller and more readable? Yes! Window functions! SELECT id, first_name, department, gross_salary, ROUND(CAST(gross_salary AS numeric(9,2)) / SUM(gross_salary) OVER (PARTITION BY department) * 100, 2) as dep_ratio, ROUND(CAST(gross_salary AS numeric(9,2)) / SUM(gross_salary) OVER () * 100, 2) as total_ratio FROM Salary ORDER BY department, dep_ratio DESC; The expression OVER() means that the "window" is all rows. Window functions only We used the aggregate functions MAX & SUM as window functions above. But SQL standard consists of functions which cannot be used as aggregates hence no way to apply them during grouping. Here is the list of them: * first_value * last_value * lead * lag * rank * dense_rank * row_number The comprehensive list of all window functions supported in PostgreSQL you can find here. Using window functions Let's use the first_value function in order to solve the very first problem where we were asked to get the name of the highest paid employee by each department. The function returns the very first values according to the provided window. SELECT id, first_name, department, gross_salary, first_value(first_name) OVER (PARTITION BY department ORDER BY gross_salary DESC ) as highest_paid_employee FROM Salary We are sorting the rows in our window by salary in a descending order, it means that the very first row will be a person with the highest salary. Now let's use last_value. It does the opposite of what first_value does. SELECT id, first_name, department, gross_salary, last_value(first_name) OVER (PARTITION BY department ORDER BY gross_salary DESC) AS lowest_paid_employee FROM Salary The result is: [lowest_paid_employees_last_value_window_function] If you look at the table you will see that the result is wrong. Why? Because of window ranges. By default if you provide the ORDER BY clause the window range is all preceding rows and current row, in SQL terms it is ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. It means that the last_value for a particular column in a particular row is a column in the same row. In order to fix this problem we need to change the range to ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: SELECT id, first_name, department, gross_salary, last_value(first_name) OVER ( PARTITION BY department ORDER BY gross_salary DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) as lowest_paid_employee FROM Salary This is how it looks visually: [unbounded_] The possible values: * N PRECEDING, N number of rows until current row * CURRENT ROW * UNBOUNDED PRECEDING * UNBOUNDED FOLLOWING * N FOLLOWING**, N number of rows following current row Related posts: * Introduction to Apache Airflow Subscription [ ] Subscribe Search [ ] Find Recent posts * Introduction to Apache Airflow * How to Work with PostgreSQL in Python * Introduction to Window Functions in SQL * What is new in Python 3.8 * Introduction to pandas: data analytics in Python * How To Deploy a Telegram Bot * How To Create a Telegram Bot Using Python * Celery Best Practices: practical approach Ad [INS::INS] Please enable JavaScript to view the comments powered by Disqus. Powered by Django 3.1.5 & Python 3.6.1 @ DigitalOcean (c) 2015 -- 2021 [32700075]