https://hakibenita.com/the-many-faces-of-distinct-in-postgre-sql

Haki Benita

  * About
  * Subscribe

  * 11 May 2017
  * PostgreSQL, SQL

The Many Faces of DISTINCT in PostgreSQL

3 interesting uses of DISTINCT in PostgreSQL

---------------------------------------------------------------------

I started my programming career as an Oracle DBA. It took a few years
but eventually I got fed up with the corporate world and I went about
doing my own thing.

When I no longer had the comfy cushion of Oracle enterprise edition I
discovered PostgreSQL. After I gotten over not having proper
partitions and MERGE statement (aka UPSERT), I found some nice unique
features in PostgreSQL. Oddly enough, a lot of them contained the
word DISTINCT.

DISTINCT

I created a simple Employee table with name, department and salary
using mock data from this site:

haki=# \d employee

    Column   |         Type          | Modifiers
------------+-----------------------+-----------
    id         | integer               | not null
    name       | character varying(30) |
    department | character varying(30) |
    salary     | integer               |

haki=# select * from employee limit 5;

    id |      name      |      department      | salary
----+----------------+----------------------+--------
    1 | Carl Frazier   | Engineering          |   3052
    2 | Richard Fox    | Product Management   |  13449
    3 | Carolyn Carter | Engineering          |   8366
    4 | Benjamin Brown | Business Development |   7386
    5 | Diana Fisher   | Services             |  10419

What is DISTINCT?

    SELECT DISTINCT eliminates duplicate rows from the result.

The simplest use of distinct is, for example, to get a unique list of
departments:

haki=# SELECT DISTINCT department FROM employee;

        department
--------------------------
    Services
    Support
    Training
    Accounting
    Business Development
    Marketing
    Product Management
    Human Resources
    Engineering
    Sales
    Research and Development
    Legal

(easy CS students, I know it's not normalized...)

We can do the same thing with group by

SELECT department FROM employee GROUP BY department;

But we are talking about DISTINCT.

---------------------------------------------------------------------

DISTINCT ON

A classic job interview question is finding the employee with the
highest salary in each department.

This is what they teach in the university:

haki=# SELECT
    *
FROM
    employee
WHERE
    (department, salary) IN (
        SELECT
            department,
            MAX(salary)
        FROM
            employee
        GROUP BY
            department
    )
ORDER BY
    department;

 id |       name       |        department        | salary
----+------------------+--------------------------+--------
 30 | Sara Roberts     | Accounting               |  13845
  4 | Benjamin Brown   | Business Development     |   7386
  3 | Carolyn Carter   | Engineering              |   8366
 20 | Janet Hall       | Human Resources          |   2826
 14 | Chris Phillips   | Legal                    |   3706
 10 | James Cunningham | Legal                    |   3706
 11 | Richard Bradley  | Marketing                |  11272
  2 | Richard Fox      | Product Management       |  13449
 25 | Evelyn Rodriguez | Research and Development |  10628
 17 | Benjamin Carter  | Sales                    |   6197
 24 | Jessica Elliott  | Services                 |  14542
  7 | Bonnie Robertson | Support                  |  12674
  8 | Jean Bailey      | Training                 |  13230

Legal has two employees with the same high salary. Depending on the
use case, this query can get pretty nasty.

If you graduated a while back, you already know a few things about
databases and you heard about analytic and window functions, you
might do this:

WITH ranked_employees AS (
    SELECT
        ROW_NUMBER() OVER (
           PARTITION BY department ORDER BY salary DESC
        ) AS rn,
        *
    FROM
        employee
)
SELECT
    *
FROM
    ranked_employees
WHERE
    rn = 1
ORDER BY
    department;

The result is the same without the duplicates:

 rn | id |       name       |        department        | salary
----+----+------------------+--------------------------+--------
  1 | 30 | Sara Roberts     | Accounting               |  13845
  1 |  4 | Benjamin Brown   | Business Development     |   7386
  1 |  3 | Carolyn Carter   | Engineering              |   8366
  1 | 20 | Janet Hall       | Human Resources          |   2826
  1 | 14 | Chris Phillips   | Legal                    |   3706
  1 | 11 | Richard Bradley  | Marketing                |  11272
...

Up until now, this is what I would have done.

Now for the real treat, PostgreSQL has a special nonstandard clause
to find the first row in a group:

SELECT DISTINCT ON (department)
    *
FROM
    employee
ORDER BY
    department,
    salary DESC;

This is wild!This is wild!

This is wild! Why nobody told me this is possible?

The docs explain DISTINCT ON:

    SELECT DISTINCT ON ( expression [, ...] ) keeps only the first row
    of each set of rows where the given expressions evaluate to equal

And the reason I haven't heard about it is:

    Nonstandard Clauses
    DISTINCT ON ( ... ) is an extension of the SQL standard.

PostgreSQL does all the heavy lifting for us. The only requirement is
that we ORDER BY the field we group by (department in this case). It
also allows for "grouping" by more than one field which only makes
this clause even more powerful.

---------------------------------------------------------------------

IS DISTINCT FROM

Comparing values in SQL can result in three outcomes - true, false or
unknown:

WITH t AS (
    SELECT 1 AS a, 1 AS b UNION ALL
    SELECT 1, 2 UNION ALL
    SELECT NULL, 1 UNION ALL
    SELECT NULL, NULL
)
SELECT
    a,
    b,
    a = b as equal
FROM
    t;

  a   |  b   | equal
------+------+-------
    1 |    1 | t
    1 |    2 | f
 NULL |    1 | NULL
 NULL | NULL | NULL

The result of comparing NULL with NULL using equality (=) is UNKNOWN
(marked as NULL in the table).

In SQL 1 = 1 and NULL IS NULL but NULL != NULL.

It's important to be aware of this subtlety because comparing
nullable fields might yield unexpected results.

The full condition to get either true or false when comparing
nullable fields is:

(a is null and b is null)
or
(a is not null and b is not null and a = b)

And the result:

  a   |  b   | equal | full_condition
------+------+-------+----------
    1 |    1 | t     | t
    1 |    2 | f     | f
 NULL |    1 | NULL  | f
 NULL | NULL | NULL  | t

This is the result we want but it is very long. Is there a better
way?

PostgreSQL implements the SQL standard for safely comparing nullable
fields:

haki=# SELECT
    a,
    b,
    a = b as equal,
    a IS DISTINCT FROM b AS is_distinct_from
FROM
    t;

  a   |  b   | equal | is_distinct_from
------+------+-------+------------------
    1 |    1 | t     | f
    1 |    2 | f     | t
 NULL |    1 | NULL  | t
 NULL | NULL | NULL  | f

PostgreSQL wiki explain IS DISTINCT FROM:

    IS DISTINCT FROM and IS NOT DISTINCT FROM ... treat NULL as if it
    was a known value, rather than a special case for unknown.

Much better - short and verbose.

How Other Databases Handle This?

  * MySQL - A special operator <=> with similar functionality.
  * Oracle - Provides a function called LNNVL to compare nullable
    fields (good luck with that...).
  * MSSQL - Couldn't find a similar function.

---------------------------------------------------------------------

ARRAY_AGG (DISTINCT)

ARRAY_AGG was one of the major selling points of PostgreSQL when I
was transitioning from Oracle.

ARRAY_AGG aggregates values into an array:

haki=# SELECT
    department,
    ARRAY_AGG(name) AS employees
FROM
    employee
GROUP BY
    department;

    department        |        employees
----------------------+-------------------------------------
Services              | {"Diana Fisher","Jessica Elliott"}
Support               | {"Bonnie Robertson"}
Training              | {"Jean Bailey"}
Accounting            | {"Phillip Reynolds","Sean Franklin"}
Business Development  | {"Benjamin Brown","Brian Hayes"}
Marketing             | {"Richard Bradley","Arthur Moreno"}
Product Management    | {"Richard Fox","Randy Wells"}
Human Resources       | {"Janet Hall"}
Engineering           | {"Carl Frazier","Carolyn Carter"}
Sales                 | {"Benjamin Carter"}
Research and Develo.. | {"Donna Reynolds","Ann Boyd"}
Legal                 | {"James Cunningham","George Hanson"}

I find ARRAY_AGG useful mostly in the CLI for getting a quick view of
the data, or when used with an ORM.

PostgreSQL took it the extra mile and implemented the DISTINCT option
for this aggregate function as well. Using DISTINCT we can, for
example, quickly view the unique salaries in each department:

haki=# SELECT
    department,
    ARRAY_AGG(DISTINCT salary) AS salaries
FROM
    employee
GROUP BY
    department;

department                | salaries
--------------------------+---------------
 Accounting               | {11203}
 Business Development     | {2196,7386}
 Engineering              | {1542,3052}
 Human Resources          | {2826}
 Legal                    | {1079,3706}
 Marketing                | {5740}
 Product Management       | {9101,13449}
 Research and Development | {6451,10628}
 Sales                    | {6197}
 Services                 | {2119}
 Support                  | {12674}
 Training                 | {13230}

We can immediately see that everyone in the support department are
making the same salary.

How Other Databases Handle This?

  * MySQL - Has a similar function called GROUP_CONCAT.
  * Oracle - Has an aggregate function called ListAgg. It has no
    support for DISTINCT. Oracle introduced the function in version
    11.2 and up until then the world wide web was filled with custom
    implementations.
  * MsSQL - The closest I found was a function called STUFF that
    accepts an expression.

---------------------------------------------------------------------

Take away

The take away from this article is that you should always go back to
the basics!

---------------------------------------------------------------------
[subscribe]

Want me to send you an email
when I publish something new?

[                    ] [YES!]
---------------------------------------------------------------------

Share to show you care

  *  
  *  
  *  
  *  
  *  

---------------------------------------------------------------------

Similar articles

  * 
      + 13 August 2019
      + PostgreSQL, SQL

    How to Get the First or Last Value in a Group Using Group By in
    SQL

    A neat little trick using arrays in PostgreSQL

  * 
      + 26 April 2021
      + PostgreSQL, SQL

    Practical SQL for Data Analysis

    What you can do without Pandas

  * 
      + 06 October 2022
      + SQL, PostgreSQL

    Future Proofing SQL with Carefully Placed Errors

    How to fail loudly when you really should

  * 
      + 17 September 2018
      + PostgreSQL, SQL, Performance

    Be Careful With CTE in PostgreSQL

    How to avoid common pitfalls with common table expressions in
    PostgreSQL

  * 
      + 22 December 2018
      + PostgreSQL, SQL, Performance

    How We Solved a Storage Problem in PostgreSQL Without Adding a
    Single Byte of Storage

    A short story about a storage-heavy query and the silver bullet
    that solved the issue

  * (c) Haki Benita 2016-2022
  * Credits
  * Appreciate